* Organize metrics into their respective service
* Server per-service metrics on a per-service http server
* Increase prometheus client usage over our custom metrics fields
This adds a handful of metrics to /api/v2/metrics/ recorded from the dispatcher main process
Adds logic in the dispatcher period tasks to calculate these for the last collection interval
Reports worker count, task count, scale up events, and availability
Add data to demo grafana dashboard
Works by adding a dedicated producer in wsrelay that looks for
local django channels message with group "metrics". The producer
sends this to the consumer running in the web container.
The consumer running in the web container handles the message by
pushing it into the local redis instance.
The django view that handles a request at the /api/v2/metrics
endpoint will load this data from redis, format it, and return the
response.
- DependencyManager spawns dependencies if necessary
- WorkflowManager processes running workflows to see if a new job is
ready to spawn
- TaskManager starts tasks if unblocked and has execution capacity
we've observed this in development and some users have reported experiencing 500's on /api/v2/metrics because of a key error here where a metric is missing from a certain instance
grafana via prometheus.
This metric is a good indicator of how far behind the callback receiver
is. The higher the load the further behind/the greater the number of
seconds the metric will display.
This number being high may indicate the need for horizontal scaling in
the control plane or vertically scaling the number of callback
receivers.
- Adds a Metrics() class that can track data such as number of
events the callback receiver inserted into database
- Exposes this metric data at the api/v2/metrics/ endpoint.
This data is prometheus-friendly
- Metric data is stored in memory, then periodically saved to Redis.
- Metric data is periodically broadcast to other nodes in the cluster,
so that each node has a copy of the most recent metric data collected.