This adds a handful of metrics to /api/v2/metrics/ recorded from the dispatcher main process
Adds logic in the dispatcher period tasks to calculate these for the last collection interval
Reports worker count, task count, scale up events, and availability
Add data to demo grafana dashboard
More fun in the grafana dashboard. The rows organize the panels and are
collapsable. Also, tested with multiple nodes and fixed some
labeling issues when there are more than one node.
Update grafana alerting readme info and some fun prose about one of the
alerts as well as some reorganizing of the code for clarity.
finally, drop the time to fire for alerts because it's better to have them be a bit touchy so users can verify they work vs. not being sure.
* fix name to be consistent
this is not a mean, its the last value
so say that in the name
* add remaining capacity to dashboard
also make legends pretty with nice names
typo in URL and in grafana alert rule
Important learning: no newlines in rules/equations
turns out datasourceUid can be set in prometheus_source.yml, and it can be anything we want. So I have set it to awx_alert, the PBFAnumbersetc value it was set to before was an autogenerated UID, and it would actually work just with that generated value, but because we want it to make sense, we're setting the value in prometheus_source.yml
finally, update the docs to be reflective of grafana docs and how to export new rules a user might want to add.
Co-authored-by: Elijah DeLee <kdelee@redhat.com>
This rule alerts if the redis queue is larger than what the rolling
average event insertion rate/second * 120. In other words, if the redis
queue is larger than it appears we can process events in two minutes.
It appears it has to meet this condition for 60 seconds to start firing.
Future commits will address how to configure contact points like slack.
shout out to @jainnikhil30 and @rebeccahhh who figured this out in jam
session this morning.