* Enable new fancy asyncio metrics for dispatcherd
Remove old dispatcher metrics and patch in new data from local whatever
Update test fixture to new dispatcherd version
* Update dispatcherd again
* Handle node filter in URL, and catch more errors
* Add test for metric filter
* Split module for dispatcherd metrics
* AAP-57817 Add Redis connection retry using redis-py 7.0+ built-in mechanism
* Refactor Redis client helpers to use settings and eliminate code duplication
* Create awx/main/utils/redis.py and move Redis client functions to avoid circular imports
* Fix subsystem_metrics to share Redis connection pool between
client and pipeline
* Cache Redis clients in RelayConsumer and RelayWebsocketStatsManager to avoid creating new connection pools on every call
* Add cap and base config
* Add Redis retry logic with exponential backoff to handle connection failures during long-running operations
* Add REDIS_BACKOFF_CAP and REDIS_BACKOFF_BASE settings to allow
adjustment of retry timing in worst-case scenarios without code changes
* Simplify Redis retry tests by removing unnecessary reload logic
* Previously, we would error out because we assumed that when we got a
metrics payload from redis, that there was data in it and it was for
the current host.
* Now, we do not assume that since we got a metrics payload, that is
well formed and for the current hostname because the hostname could
have changed and we could have not yet collected metrics for the new
host.
Separate out operation subsystem metrics to fix duplicate error
Remove unnecessary comments
Revert to single subsystem_metrics_* metric with labels
Format via black
* Organize metrics into their respective service
* Server per-service metrics on a per-service http server
* Increase prometheus client usage over our custom metrics fields
This adds a handful of metrics to /api/v2/metrics/ recorded from the dispatcher main process
Adds logic in the dispatcher period tasks to calculate these for the last collection interval
Reports worker count, task count, scale up events, and availability
Add data to demo grafana dashboard
Works by adding a dedicated producer in wsrelay that looks for
local django channels message with group "metrics". The producer
sends this to the consumer running in the web container.
The consumer running in the web container handles the message by
pushing it into the local redis instance.
The django view that handles a request at the /api/v2/metrics
endpoint will load this data from redis, format it, and return the
response.
- DependencyManager spawns dependencies if necessary
- WorkflowManager processes running workflows to see if a new job is
ready to spawn
- TaskManager starts tasks if unblocked and has execution capacity
we've observed this in development and some users have reported experiencing 500's on /api/v2/metrics because of a key error here where a metric is missing from a certain instance
grafana via prometheus.
This metric is a good indicator of how far behind the callback receiver
is. The higher the load the further behind/the greater the number of
seconds the metric will display.
This number being high may indicate the need for horizontal scaling in
the control plane or vertically scaling the number of callback
receivers.
- Adds a Metrics() class that can track data such as number of
events the callback receiver inserted into database
- Exposes this metric data at the api/v2/metrics/ endpoint.
This data is prometheus-friendly
- Metric data is stored in memory, then periodically saved to Redis.
- Metric data is periodically broadcast to other nodes in the cluster,
so that each node has a copy of the most recent metric data collected.