Commit Graph

29 Commits

Author SHA1 Message Date
Alan Rominger
5e93f60b9e AAP-41776 Enable new fancy asyncio metrics for dispatcherd (#16233)
* Enable new fancy asyncio metrics for dispatcherd

Remove old dispatcher metrics and patch in new data from local whatever

Update test fixture to new dispatcherd version

* Update dispatcherd again

* Handle node filter in URL, and catch more errors

* Add test for metric filter

* Split module for dispatcherd metrics
2026-02-04 15:28:34 -05:00
Lila Yasin
4f41b50a09 AAP-57817 Add Redis connection retry using redis-py 7.0+ built-in (#16176)
* AAP-57817 Add Redis connection retry using redis-py 7.0+ built-in mechanism

* Refactor Redis client helpers to use settings and eliminate code duplication

* Create awx/main/utils/redis.py and move Redis client functions to avoid circular imports

* Fix subsystem_metrics to share Redis connection pool between
  client and pipeline

* Cache Redis clients in RelayConsumer and RelayWebsocketStatsManager to avoid creating new connection pools on every call

* Add cap and base config

* Add Redis retry logic with exponential backoff to handle connection failures during long-running operations

* Add REDIS_BACKOFF_CAP and REDIS_BACKOFF_BASE settings to allow
  adjustment of retry timing in worst-case scenarios without code changes

* Simplify Redis retry tests by removing unnecessary reload logic
2025-12-01 09:08:47 -05:00
Chris Meyers
51b2524b25 Gracefully handle hostname change in metrics code
* Previously, we would error out because we assumed that when we got a
  metrics payload from redis, that there was data in it and it was for
  the current host.
* Now, we do not assume that since we got a metrics payload, that is
  well formed and for the current hostname because the hostname could
  have changed and we could have not yet collected metrics for the new
  host.
2025-10-09 14:08:01 -04:00
Chris Coutinho
612e8e7688 Fix duplicate metrics in AWX subsystem_metrics (#15964)
Separate out operation subsystem metrics to fix duplicate error

Remove unnecessary comments

Revert to single subsystem_metrics_* metric with labels

Format via black
2025-10-09 10:28:55 +02:00
Alan Rominger
c3ee0c2d8a Sensible log behavior when redis is unavailable (#15466)
* Sensible log behavior when redis is unavailable

* Consistent behavior with dispatcher and callback
2025-04-10 13:45:05 -07:00
Chris Meyers
d388f91bcd Metrics dispatcher callback receiver swaparoo 2024-11-08 00:06:17 -05:00
Chris Meyers
8a902debd5 Per-service metrics http server
* Organize metrics into their respective service
* Server per-service metrics on a per-service http server
* Increase prometheus client usage over our custom metrics fields
2024-02-05 15:17:24 -05:00
Alan Rominger
ef99770383 Add subsystem metrics for the dispatcher (#13989)
This adds a handful of metrics to /api/v2/metrics/ recorded from the dispatcher main process

Adds logic in the dispatcher period tasks to calculate these for the last collection interval
Reports worker count, task count, scale up events, and availability

Add data to demo grafana dashboard
2023-05-17 14:29:31 -04:00
Seth Foster
1c51ef8a69 Store serialized metrics locally (#13833) 2023-04-11 15:06:48 -04:00
Seth Foster
33f070081c Send subsystem metrics via wsrelay (#13333)
Works by adding a dedicated producer in wsrelay that looks for
local django channels message with group "metrics". The producer
sends this to the consumer running in the web container.

The consumer running in the web container handles the message by
pushing it into the local redis instance.

The django view that handles a request at the /api/v2/metrics
endpoint will load this data from redis, format it, and return the
response.
2023-03-29 22:09:18 -04:00
Shane McDonald
ab6d56c24e initial PoC for wsrelay
Checkpoint
2023-03-29 22:04:43 -04:00
Alan Rominger
1f939aa25e Merge pull request #12884 from AlanCoding/is_testing
[tech debt] Move the IS_TESTING method out of settings
2022-11-09 15:29:35 -05:00
Alan Rominger
a64467c5a6 Shortcut Instance.objects.me when possible 2022-10-05 09:11:42 -04:00
Alan Rominger
cfce31419d Move the IS_TESTING method out of settings 2022-09-28 11:19:10 -04:00
Alan Rominger
9e8ba6ca09 Merge pull request #12494 from AlanCoding/revival
Register system again if deleted by another pod
2022-08-17 10:12:39 -04:00
Alan Rominger
268ab128d7 Merge pull request #12527 from AlanCoding/offline_db
Further resiliency changes, specifically focused on case of database going offline
2022-08-17 10:10:50 -04:00
Seth Foster
55d295c2a6 Add metric to measure task manager transaction, including on_commit calls 2022-08-15 12:44:29 -04:00
Alan Rominger
30f556f845 Further resiliency changes focused on offline database
Make logs from database outage more manageable

Raise exception if update_model never recovers from problem
2022-08-10 16:16:57 -04:00
Alan Rominger
f7e6a32444 Optimize task manager with debug toolbar, adjust prefetch (#12588) 2022-08-10 10:05:13 -04:00
Alan Rominger
585d3f4e2a Register system again if deleted by another pod
Avoid cases where missing instance
  would throw error on startup
  this gives time for heartbeat to register it
2022-08-08 22:36:17 -04:00
Seth Foster
431b9370df Split TaskManager into
- DependencyManager spawns dependencies if necessary
- WorkflowManager processes running workflows to see if a new job is
  ready to spawn
- TaskManager starts tasks if unblocked and has execution capacity
2022-08-05 14:29:02 -04:00
Seth Foster
c92619a2dc Subsystem metrics reset_values should remove all redis keys 2022-06-16 16:54:37 -04:00
Elijah DeLee
7cbe112e4e possible work around for 500 on /api/v2/metrics (#12376)
we've observed this in development and some users have reported experiencing 500's on /api/v2/metrics because of a key error here where a metric is missing from a certain instance
2022-06-16 13:15:25 -04:00
Seth Foster
2f82b75748 Add subsystem metrics for task manager 2022-06-14 11:00:11 -04:00
Rebeccah
5f9326b131 added average event processing metric (in seconds) that can be served to
grafana via prometheus.

This metric is a good indicator of how far behind the callback receiver
is. The higher the load the further behind/the greater the number of
seconds the metric will display.

This number being high may indicate the need for horizontal scaling in
the control plane or vertically scaling the number of callback
receivers.
2022-06-06 15:14:56 -04:00
Seth Foster
acebff7be1 Fix sync-only operation in async context 2022-03-21 14:37:10 -04:00
Seth Foster
6db7cea148 variable name changes 2022-02-10 10:57:00 -05:00
Seth Foster
3993aa9524 Add metric for number of events emitted over websocket broadcast 2022-02-09 21:57:01 -05:00
Seth Foster
0c569c67fd Add subsystem metrics
- Adds a Metrics() class that can track data such as number of
events the callback receiver inserted into database
- Exposes this metric data at the api/v2/metrics/ endpoint.
This data is prometheus-friendly
- Metric data is stored in memory, then periodically saved to Redis.
- Metric data is periodically broadcast to other nodes in the cluster,
so that each node has a copy of the most recent metric data collected.
2021-03-25 15:23:52 -04:00