External-Mirrors/awx

mirror of https://github.com/ansible/awx.git synced 2026-03-03 09:48:51 -03:30

Author	SHA1	Message	Date
Alan Rominger	f377b5fdde	Use runtime log utility moved to DAB (#15675 ) * Use runtime log utility moved to DAB	2024-12-11 10:38:24 -05:00
Rick Elrod	48edb15a03	Prevent Dispatcher deadlock when Redis disappears (#14249 ) This fixes https://github.com/ansible/awx/issues/14245 which has more information about this issue. This change addresses both: - A clashing signal handler (registering a callback to fire when the task manager times out, and hitting that callback in cases where we didn't expect to). Make dispatcher timeout use SIGUSR1, not SIGTERM. - Metrics not being reported should not make us crash, so that is now fixed as well. Signed-off-by: Rick Elrod <rick@elrod.me> Co-authored-by: Alan Rominger <arominge@redhat.com>	2023-07-18 10:43:46 -05:00
Alan Rominger	ef99770383	Add subsystem metrics for the dispatcher (#13989 ) This adds a handful of metrics to /api/v2/metrics/ recorded from the dispatcher main process Adds logic in the dispatcher period tasks to calculate these for the last collection interval Reports worker count, task count, scale up events, and availability Add data to demo grafana dashboard	2023-05-17 14:29:31 -04:00
Alan Rominger	f5785976be	Update to comply with new black rules	2023-02-01 14:59:38 -05:00
Jeff Bradberry	e029cf7196	Remove the django-qsstats-magic dependency	2022-11-10 15:37:44 -05:00
Alan Rominger	cba780a8f8	Fix dispatcher connection deadlock w scheduler and cleanup	2022-10-19 12:12:15 -04:00
Alan Rominger	974f845059	Revert "Merge pull request #12584 from AlanCoding/lazy_workers" This reverts commit `64157f7207`, reversing changes made to `9e8ba6ca09`.	2022-08-28 23:04:13 -04:00
Alan Rominger	e0c59d12c1	Change data structure so we can conditionally reap waiting jobs	2022-08-17 16:00:30 -04:00
Alan Rominger	6719010050	Add back in cleanup call	2022-08-17 15:42:48 -04:00
Alan Rominger	ccd46a1c0f	Move reaper logic into worker, avoiding bottlenecks	2022-08-17 15:42:47 -04:00
Alan Rominger	621833ef0e	Add extra workers if computing based on memory Co-authored-by: Elijah DeLee <kdelee@redhat.com>	2022-08-17 11:41:59 -04:00
Shane McDonald	3c51cb130f	Add grace period settings for task manager timeout, and pod / job waiting reapers Co-authored-by: Alan Rominger <arominge@redhat.com>	2022-08-17 11:39:01 -04:00
Shane McDonald	c649809eb2	Remove debug method that calls cleanup - It's unclear why this was here. - Removing it doesnt appear to cause any problems. - It still gets called during heartbeats.	2022-08-17 11:35:43 -04:00
Alan Rominger	a3fef27002	Add logs to debug waiting bottlenecking	2022-08-17 11:33:49 -04:00
Alan Rominger	278db2cdde	Split reaper for running and waiting jobs Avoid running jobs that have already been reapted Co-authored-by: Elijah DeLee <kdelee@redhat.com> Remove unnecessary extra actions Fix waiting jobs in other cases of reaping	2022-08-17 10:53:29 -04:00
Alan Rominger	c5985c4c81	Change lazy worker method name and adjust log	2022-08-10 16:12:03 -04:00
Alan Rominger	a9170236e1	Wait 60 seconds before scaling down a worker	2022-08-10 16:12:03 -04:00
Alan Rominger	f7e6a32444	Optimize task manager with debug toolbar, adjust prefetch (#12588 )	2022-08-10 10:05:13 -04:00
Seth Foster	b3eb9e0193	pid kill each of the 3 task managers on timeout	2022-08-05 14:33:25 -04:00
Elijah DeLee	7eb0c7dd28	exit task manager loops early if we are timed out add settings to define task manager timeout and grace period This gives us still TASK_MANAGER_TIMEOUT_GRACE_PERIOD amount of time to get out of the task manager. Also, apply start task limit in WorkflowManager to starting pending workflows	2022-08-05 14:33:24 -04:00
Jeff Bradberry	1803c5bdb4	Fix up usage of django-guid It has replaced the class-based middleware, everything is function-based now.	2022-03-14 13:19:57 -04:00
Jeff Bradberry	b852baaa39	Fix up logger .warn() calls to use .warning() instead This is a usage that was deprecated in Python 3.0.	2022-03-07 18:11:36 -05:00
Elijah DeLee	4bd6c2a804	set max dispatch workers to same as max forks Right now, without this, we end up with a different number for max_workers than max_forks. For example, on a control node with 16 Gi of RAM, max_mem_capacity w/ 100 MB/fork = (161024)/100 --> 164 max_workers = 5 16 --> 80 This means we would allow that control node to control up to 164 jobs, but all jobs after the 80th job will be stuck in `waiting` waiting for a dispatch worker to free up to run the job.	2022-02-24 10:53:54 -05:00
Elijah DeLee	799968460d	Fixup conversion of memory and cpu settings to support k8s resource request format (#11725 ) fix memory and cpu settings to suport k8s resource request format * fix conversion of memory setting to bytes This setting has not been getting set by default, and needed some fixing up to be compatible with setting the memory in the same way as we set it in the operator, as well as with other changes from last year which assume that ansible runner is returning memory in bytes. This way we can start setting this setting in the operator, and get a more accurate reflection of how much memory is available to the control pod in k8s. On platforms where services are all sharing memory, we deduct a penalty from the memory available. On k8s we don't need to do this because the web, redis, and task containers each have memory allocated to them. * Support CPU setting expressed in units used by k8s This setting has not been getting set by default, and needed some fixing up to be compatible with setting the CPU resource request/limits in the same way as we set it in the resource requests/limits. This way we can start setting this setting in the operator, and get a more accurate reflection of how much cpu is available to the control pod in k8s. Because cpu on k8s can be partial cores, migrate cpu field to decimal. k8s does not allow granularity of less than 100m (equivalent to 0.1 cores), so only store up to 1 decimal place. fix analytics to deal with decimal cpu need to use DjangoJSONEncoder when Decimal fields in data passed to json.dumps	2022-02-15 14:08:24 -05:00
Alan Rominger	b721a4b361	Remove dev-only log filters and downgrade periodic logs	2021-12-07 14:35:02 -05:00
Ryan Petrello	c2ef0a6500	move code linting to a stricter pep8-esque auto-formatting tool, black	2021-03-23 09:39:58 -04:00
Ryan Petrello	3cc3cf1f80	add a per-request GUID and log as it travels through background services see: https://github.com/ansible/awx/issues/9329	2021-02-17 12:54:13 -05:00
Ryan Petrello	57f8e48894	make --status more robust for dispatcher, and add support for receiver make the --status flag work by fetching a periodically recorded snapshot of internal process state; additionally, update the callback receiver to also record these statistics so we can gain more insight into any performance issues	2020-09-17 15:33:37 -04:00
Ryan Petrello	0df6409244	remove task state tracking from the callback receiver we don't have support for displaying these stats anyways, so there's no point in using resources tracking them, especially for high-volume installs	2020-09-16 13:40:42 -04:00
Rebeccah	118e1b8df1	removing memchache mentions in comments remove memcached folder as it is no longer needed, also address a couple grammatical errors	2020-06-18 15:52:59 -04:00
Ryan Petrello	8b1806d4ca	add code for detecting (and killing) a hung task manager task	2020-02-26 07:53:04 -05:00
Ryan Petrello	8f33f1a6c2	remove another expensive logging lookup in the parent callback process	2020-01-24 16:46:32 -05:00
Ryan Petrello	306f504fb7	optimize the callback receiver to buffer writes on high throughput additionaly, optimize away several per-event host lookups and changed/failed propagation lookups we've always performed these (fairly expensive) queries on every event save - if you're processing tens of thousands of events in short bursts, this is way too slow this commit also introduces a new command for profiling the insertion rate of events, `awx-manage callback_stats` see: https://github.com/ansible/awx/issues/5514	2020-01-14 12:04:26 -05:00
AlanCoding	eec08fdcca	Log case of duplicate UUIDs	2020-01-09 07:31:32 -05:00
Ryan Petrello	83550eeba0	make the callback receiver more robust to duplicate UUIDs from ansible	2019-11-01 09:24:52 -04:00
Ryan Petrello	32ee9838af	use the correct logger for the callback receiver the callback receiver and dispatcher share several modules, so add logic to use the correct logger	2019-03-15 08:09:47 -04:00
Ryan Petrello	4707dc2a05	clean up some unnecessary dispatcher reaping code	2019-01-24 11:11:05 -05:00
Ryan Petrello	b2442d42a3	detect dead DB connections in the dispatcher when reaping jobs	2019-01-22 08:40:26 -05:00
Ryan Petrello	f223df303f	convert py2 -> py3	2019-01-15 14:09:01 -05:00
Matthew Jones	7330102961	Remove a warning message for dispatcher pool for tests	2018-11-19 11:19:57 -05:00
Ryan Petrello	37234ca66e	prevent the dispatcher from using a nonsensical max_workers value	2018-11-16 10:16:39 -05:00
Ryan Petrello	0d29bbfdc6	make the dispatcher more fault-tolerant to prolonged database outages	2018-10-18 20:00:07 -04:00
Ryan Petrello	720a634702	don't attempt to recover special QUIT messages in the worker pool when `--reload` is sent to the dispatcher, it sends a special QUIT message to each worker in the pool so that it will exit gracefully at the next opportunity when a worker process exits unexpectedly, the dispatcher attempts to recover its queued messages and sends them to another worker in the pool; in this scenario, we should _never_ re-enqueue these special QUIT messages (because the process doesn't need to quit, it's already gone) To reproduce this race condition: 1. Launch an adhoc that does `sleep 60` 2. Run `awx-manage run_dispatcher --reload` to enqueue a `QUIT` message into the worker's queue 3. Find the pid of the worker running the `sleep 60` and `SIGKILL` it. 4. Observe that dispatcher attempts to requeue the `QUIT` message and logs a confusing error.	2018-10-15 12:17:52 -04:00
Ryan Petrello	ff1e8cc356	replace celery task decorators with a kombu-based publisher this commit implements the bulk of `awx-manage run_dispatcher`, a new command that binds to RabbitMQ via kombu and balances messages across a pool of workers that are similar to celeryd workers in spirit. Specifically, this includes: - a new decorator, `awx.main.dispatch.task`, which can be used to decorate functions or classes so that they can be designated as "Tasks" - support for fanout/broadcast tasks (at this point in time, only `conf.Setting` memcached flushes use this functionality) - support for job reaping - support for success/failure hooks for job runs (i.e., `handle_work_success` and `handle_work_error`) - support for auto scaling worker pool that scale processes up and down on demand - minimal support for RPC, such as status checks and pool recycle/reload	2018-10-11 10:53:30 -04:00
Ryan Petrello	da74f1d01f	refactor and test the callback receiver as a base for a task dispatcher	2018-10-11 10:53:26 -04:00

45 Commits