External-Mirrors/awx

mirror of https://github.com/ansible/awx.git synced 2026-03-11 06:29:31 -02:30

Author	SHA1	Message	Date
Alan Rominger	5e93f60b9e	AAP-41776 Enable new fancy asyncio metrics for dispatcherd (#16233 ) * Enable new fancy asyncio metrics for dispatcherd Remove old dispatcher metrics and patch in new data from local whatever Update test fixture to new dispatcherd version * Update dispatcherd again * Handle node filter in URL, and catch more errors * Add test for metric filter * Split module for dispatcherd metrics	2026-02-04 15:28:34 -05:00
Lila Yasin	4f41b50a09	AAP-57817 Add Redis connection retry using redis-py 7.0+ built-in (#16176 ) * AAP-57817 Add Redis connection retry using redis-py 7.0+ built-in mechanism * Refactor Redis client helpers to use settings and eliminate code duplication * Create awx/main/utils/redis.py and move Redis client functions to avoid circular imports * Fix subsystem_metrics to share Redis connection pool between client and pipeline * Cache Redis clients in RelayConsumer and RelayWebsocketStatsManager to avoid creating new connection pools on every call * Add cap and base config * Add Redis retry logic with exponential backoff to handle connection failures during long-running operations * Add REDIS_BACKOFF_CAP and REDIS_BACKOFF_BASE settings to allow adjustment of retry timing in worst-case scenarios without code changes * Simplify Redis retry tests by removing unnecessary reload logic	2025-12-01 09:08:47 -05:00
Chris Meyers	51b2524b25	Gracefully handle hostname change in metrics code * Previously, we would error out because we assumed that when we got a metrics payload from redis, that there was data in it and it was for the current host. * Now, we do not assume that since we got a metrics payload, that is well formed and for the current hostname because the hostname could have changed and we could have not yet collected metrics for the new host.	2025-10-09 14:08:01 -04:00
Chris Coutinho	612e8e7688	Fix duplicate metrics in AWX subsystem_metrics (#15964 ) Separate out operation subsystem metrics to fix duplicate error Remove unnecessary comments Revert to single subsystem_metrics_* metric with labels Format via black	2025-10-09 10:28:55 +02:00
Alan Rominger	c3ee0c2d8a	Sensible log behavior when redis is unavailable (#15466 ) * Sensible log behavior when redis is unavailable * Consistent behavior with dispatcher and callback	2025-04-10 13:45:05 -07:00
Chris Meyers	d388f91bcd	Metrics dispatcher callback receiver swaparoo	2024-11-08 00:06:17 -05:00
Chris Meyers	8a902debd5	Per-service metrics http server * Organize metrics into their respective service * Server per-service metrics on a per-service http server * Increase prometheus client usage over our custom metrics fields	2024-02-05 15:17:24 -05:00
Alan Rominger	ef99770383	Add subsystem metrics for the dispatcher (#13989 ) This adds a handful of metrics to /api/v2/metrics/ recorded from the dispatcher main process Adds logic in the dispatcher period tasks to calculate these for the last collection interval Reports worker count, task count, scale up events, and availability Add data to demo grafana dashboard	2023-05-17 14:29:31 -04:00
Seth Foster	1c51ef8a69	Store serialized metrics locally (#13833 )	2023-04-11 15:06:48 -04:00
Seth Foster	33f070081c	Send subsystem metrics via wsrelay (#13333 ) Works by adding a dedicated producer in wsrelay that looks for local django channels message with group "metrics". The producer sends this to the consumer running in the web container. The consumer running in the web container handles the message by pushing it into the local redis instance. The django view that handles a request at the /api/v2/metrics endpoint will load this data from redis, format it, and return the response.	2023-03-29 22:09:18 -04:00
Shane McDonald	ab6d56c24e	initial PoC for wsrelay Checkpoint	2023-03-29 22:04:43 -04:00
Alan Rominger	1f939aa25e	Merge pull request #12884 from AlanCoding/is_testing [tech debt] Move the IS_TESTING method out of settings	2022-11-09 15:29:35 -05:00
Alan Rominger	a64467c5a6	Shortcut Instance.objects.me when possible	2022-10-05 09:11:42 -04:00
Alan Rominger	cfce31419d	Move the IS_TESTING method out of settings	2022-09-28 11:19:10 -04:00
Alan Rominger	9e8ba6ca09	Merge pull request #12494 from AlanCoding/revival Register system again if deleted by another pod	2022-08-17 10:12:39 -04:00
Alan Rominger	268ab128d7	Merge pull request #12527 from AlanCoding/offline_db Further resiliency changes, specifically focused on case of database going offline	2022-08-17 10:10:50 -04:00
Seth Foster	55d295c2a6	Add metric to measure task manager transaction, including on_commit calls	2022-08-15 12:44:29 -04:00
Alan Rominger	30f556f845	Further resiliency changes focused on offline database Make logs from database outage more manageable Raise exception if update_model never recovers from problem	2022-08-10 16:16:57 -04:00
Alan Rominger	f7e6a32444	Optimize task manager with debug toolbar, adjust prefetch (#12588 )	2022-08-10 10:05:13 -04:00
Alan Rominger	585d3f4e2a	Register system again if deleted by another pod Avoid cases where missing instance would throw error on startup this gives time for heartbeat to register it	2022-08-08 22:36:17 -04:00
Seth Foster	431b9370df	Split TaskManager into - DependencyManager spawns dependencies if necessary - WorkflowManager processes running workflows to see if a new job is ready to spawn - TaskManager starts tasks if unblocked and has execution capacity	2022-08-05 14:29:02 -04:00
Seth Foster	c92619a2dc	Subsystem metrics reset_values should remove all redis keys	2022-06-16 16:54:37 -04:00
Elijah DeLee	7cbe112e4e	possible work around for 500 on /api/v2/metrics (#12376 ) we've observed this in development and some users have reported experiencing 500's on /api/v2/metrics because of a key error here where a metric is missing from a certain instance	2022-06-16 13:15:25 -04:00
Seth Foster	2f82b75748	Add subsystem metrics for task manager	2022-06-14 11:00:11 -04:00
Rebeccah	5f9326b131	added average event processing metric (in seconds) that can be served to grafana via prometheus. This metric is a good indicator of how far behind the callback receiver is. The higher the load the further behind/the greater the number of seconds the metric will display. This number being high may indicate the need for horizontal scaling in the control plane or vertically scaling the number of callback receivers.	2022-06-06 15:14:56 -04:00
Seth Foster	acebff7be1	Fix sync-only operation in async context	2022-03-21 14:37:10 -04:00
Seth Foster	6db7cea148	variable name changes	2022-02-10 10:57:00 -05:00
Seth Foster	3993aa9524	Add metric for number of events emitted over websocket broadcast	2022-02-09 21:57:01 -05:00
Seth Foster	0c569c67fd	Add subsystem metrics - Adds a Metrics() class that can track data such as number of events the callback receiver inserted into database - Exposes this metric data at the api/v2/metrics/ endpoint. This data is prometheus-friendly - Metric data is stored in memory, then periodically saved to Redis. - Metric data is periodically broadcast to other nodes in the cluster, so that each node has a copy of the most recent metric data collected.	2021-03-25 15:23:52 -04:00

29 Commits