External-Mirrors/awx

mirror of https://github.com/ansible/awx.git synced 2026-03-03 01:38:50 -03:30

Author	SHA1	Message	Date
John Westcott IV	cd4d83acb7	Compensating for NUL unicode characters NUL characters are not allowed in text fields in the database We used to strip them out of stdout but the exception changed And we want to be sure to strip them out of JSONBlob fields	2023-06-14 17:40:15 -04:00
John Westcott IV	e47d30974c	Removing psycopg2 references	2023-06-14 17:40:15 -04:00
Alan Rominger	fbaeb90268	Apply conservative database connection reduction changes (#14066 ) This is expected to free up 4 additional database connections per traditional node compare to roughly 12 in total before this change Out of these 3 are accomplished by using existing connection for recently added services then 1 is obtained by closing the connection for the idle callback receiver main process Signed-off-by: jessicamack <jmack@redhat.com> Co-authored-by: jessicamack <jmack@redhat.com>	2023-06-01 14:59:18 -04:00
Alan Rominger	ef99770383	Add subsystem metrics for the dispatcher (#13989 ) This adds a handful of metrics to /api/v2/metrics/ recorded from the dispatcher main process Adds logic in the dispatcher period tasks to calculate these for the last collection interval Reports worker count, task count, scale up events, and availability Add data to demo grafana dashboard	2023-05-17 14:29:31 -04:00
Alan Rominger	342e9197b8	Customize application_name for different connections in dispatcher service (#13074 ) * Introduce new method in settings, import in-line w NOQA mark * Further refine the app_name to use shorter service names like dispatcher * Clean up listener logic, change some names	2023-04-13 22:36:36 -04:00
Hao Liu	c8c8ed1775	Raise ValueError when no ready and enabled task instance	2023-03-29 22:09:19 -04:00
Hao Liu	25303ee625	Only select task instance that are ready and enabled When select a queue for task instance to run task only select task instance that are ready and enabled	2023-03-29 22:09:19 -04:00
Hao Liu	cd3f7666be	add get_task_queuename get_local_queuename will return the pod name of the instance now that web and task are in different pods when web container queue a task it will be put into a queue without as task worker to execute the task	2023-03-29 22:09:19 -04:00
Jessica Mack	43f4872fec	these methods don't need to be class methods Signed-off-by: Jessica Mack <jmack@redhat.com>	2023-03-29 22:04:43 -04:00
Gabriel Muniz	e15f4de0dd	Fix race with heartbeat and reaper logic (#13713 ) * Fix race with heartbeat and reaper logic * Fix tests to fail when over drift over heartbeat time * replaced modified with started time for reap() code and added test * fixed logic bug and cleaned up tests * Added comments to tests to call out reasoning	2023-03-17 14:24:31 -04:00
Alan Rominger	6c1d4a5cfd	Skip callback receiver bulk_create with 0 events	2023-02-04 12:10:39 -05:00
Alan Rominger	f5785976be	Update to comply with new black rules	2023-02-01 14:59:38 -05:00
Alan Rominger	8a4059d266	Workaround for events with NUL char, touch up error loop (#13398 ) * Workaround for events with NUL char, touch up error loop This fixes an error where some events would not save due to having the 0x00 character which errors in postgres this adds a line to replace it with empty text Hitting that kind of event put us in an infinite error loop so this change makes a number of changes to prevent similar loops the showcase example is a negative counter, this is not realistic in the real world but works for unit tests These error loop fixes seek to esablish the cases where we clear the buffer Some logic is removed from the outer loop, with the idea that ensure_connection will better distinguish flake * From review comments, delay NUL char sanitization to later Use pop to make list operations more clear * Fix incorrect use of pop	2023-01-19 13:36:23 -05:00
Jeff Bradberry	721e19e1c8	Merge pull request #13181 from jbradberry/remove-qsstats Replace the querysets provided by django-qsstats-magic	2022-11-11 10:58:51 -05:00
Jeff Bradberry	e029cf7196	Remove the django-qsstats-magic dependency	2022-11-10 15:37:44 -05:00
Alan Rominger	1f939aa25e	Merge pull request #12884 from AlanCoding/is_testing [tech debt] Move the IS_TESTING method out of settings	2022-11-09 15:29:35 -05:00
Alan Rominger	192f45bbd0	Make canceling view non-atomic to fix 500 errors with job bursts (#13072 ) * Make canceling view non-atomic to fix 500 errors with job bursts * Update test calls for cancel method changes	2022-10-20 15:02:54 -04:00
Alan Rominger	cba780a8f8	Fix dispatcher connection deadlock w scheduler and cleanup	2022-10-19 12:12:15 -04:00
Alan Rominger	a64467c5a6	Shortcut Instance.objects.me when possible	2022-10-05 09:11:42 -04:00
Alan Rominger	cfce31419d	Move the IS_TESTING method out of settings	2022-09-28 11:19:10 -04:00
Alan Rominger	c59bbdecdb	Refactor canceling to work through messaging and signals, not database If canceled attempted before, still allow attempting another cancel in this case, attempt to send the sigterm signal again. Keep clicking, you might help! Replace other cancel_callbacks with sigterm watcher adapt special inventory mechanism for this too Get rid of the cancel_watcher method with exception in main thread Handle academic case of sigterm race condition Process cancelation as control signal Fully connect cancel method and run_dispatcher to control Never transition workflows directly to canceled, add logs	2022-09-01 15:20:31 -04:00
Alan Rominger	01037fa561	Fix sanity check to use the relevant active connection	2022-08-29 16:33:07 -04:00
Alan Rominger	974f845059	Revert "Merge pull request #12584 from AlanCoding/lazy_workers" This reverts commit `64157f7207`, reversing changes made to `9e8ba6ca09`.	2022-08-28 23:04:13 -04:00
Alan Rominger	e87fabe6bb	Submit job to dispatcher as part of transaction (#12573 ) Make it so that submitting a task to the dispatcher happens as part of the transaction. this applies to dispatcher task "publishers" which NOTIFY the pg_notify queue if the transaction is not successful, it will not be sent, as per postgres docs This keeps current behavior for pg_notify listeners practically, this only applies for the awx-manage run_dispatcher service this requires creating a separate connection and keeping it long-lived arbitrary code will occasionally close the main connection, which would stop listening Stop sending the waiting status websocket message this is required because the ordering cannot be maintained with other changes here the instance group data is moved to the running websocket message payload Move call to create_partition from task manager to pre_run_hook mock this in relevant unit tests	2022-08-18 09:43:53 -04:00
Alan Rominger	e0c59d12c1	Change data structure so we can conditionally reap waiting jobs	2022-08-17 16:00:30 -04:00
Alan Rominger	6719010050	Add back in cleanup call	2022-08-17 15:42:48 -04:00
Alan Rominger	ccd46a1c0f	Move reaper logic into worker, avoiding bottlenecks	2022-08-17 15:42:47 -04:00
Alan Rominger	621833ef0e	Add extra workers if computing based on memory Co-authored-by: Elijah DeLee <kdelee@redhat.com>	2022-08-17 11:41:59 -04:00
Shane McDonald	16be38bb54	Allow for passing custom job_explanation to reaper methods Co-authored-by: Alan Rominger <arominge@redhat.com>	2022-08-17 11:41:49 -04:00
Shane McDonald	3c51cb130f	Add grace period settings for task manager timeout, and pod / job waiting reapers Co-authored-by: Alan Rominger <arominge@redhat.com>	2022-08-17 11:39:01 -04:00
Shane McDonald	c649809eb2	Remove debug method that calls cleanup - It's unclear why this was here. - Removing it doesnt appear to cause any problems. - It still gets called during heartbeats.	2022-08-17 11:35:43 -04:00
Alan Rominger	a3fef27002	Add logs to debug waiting bottlenecking	2022-08-17 11:33:49 -04:00
Alan Rominger	278db2cdde	Split reaper for running and waiting jobs Avoid running jobs that have already been reapted Co-authored-by: Elijah DeLee <kdelee@redhat.com> Remove unnecessary extra actions Fix waiting jobs in other cases of reaping	2022-08-17 10:53:29 -04:00
Alan Rominger	64157f7207	Merge pull request #12584 from AlanCoding/lazy_workers Wait 60 seconds before scaling down a worker	2022-08-17 10:18:19 -04:00
Alan Rominger	9e8ba6ca09	Merge pull request #12494 from AlanCoding/revival Register system again if deleted by another pod	2022-08-17 10:12:39 -04:00
Alan Rominger	998000bfbe	Surface correct error from bulk_create on unrecoverable error	2022-08-10 16:16:57 -04:00
Alan Rominger	43a50cc62c	Fix event counting in error handling path	2022-08-10 16:16:57 -04:00
Alan Rominger	30f556f845	Further resiliency changes focused on offline database Make logs from database outage more manageable Raise exception if update_model never recovers from problem	2022-08-10 16:16:57 -04:00
Alan Rominger	c5985c4c81	Change lazy worker method name and adjust log	2022-08-10 16:12:03 -04:00
Alan Rominger	a9170236e1	Wait 60 seconds before scaling down a worker	2022-08-10 16:12:03 -04:00
Alan Rominger	f7e6a32444	Optimize task manager with debug toolbar, adjust prefetch (#12588 )	2022-08-10 10:05:13 -04:00
Alan Rominger	585d3f4e2a	Register system again if deleted by another pod Avoid cases where missing instance would throw error on startup this gives time for heartbeat to register it	2022-08-08 22:36:17 -04:00
Seth Foster	b3eb9e0193	pid kill each of the 3 task managers on timeout	2022-08-05 14:33:25 -04:00
Elijah DeLee	7eb0c7dd28	exit task manager loops early if we are timed out add settings to define task manager timeout and grace period This gives us still TASK_MANAGER_TIMEOUT_GRACE_PERIOD amount of time to get out of the task manager. Also, apply start task limit in WorkflowManager to starting pending workflows	2022-08-05 14:33:24 -04:00
Alan Rominger	fd671ecc9d	Give specific messages if job was killed due to SIGTERM or SIGKILL (#12435 ) * Reap jobs on dispatcher startup to increase clarity, replace existing reaping logic * Exit jobs if receiving SIGTERM signal * Fix unwanted reaping on shutdown, let subprocess close out * Add some sanity tests for signal module * Add a log for an unhandled dispatcher error * Refine wording of error messages Co-authored-by: Elijah DeLee <kdelee@redhat.com>	2022-06-30 13:20:08 -04:00
Rebeccah	5f9326b131	added average event processing metric (in seconds) that can be served to grafana via prometheus. This metric is a good indicator of how far behind the callback receiver is. The higher the load the further behind/the greater the number of seconds the metric will display. This number being high may indicate the need for horizontal scaling in the control plane or vertically scaling the number of callback receivers.	2022-06-06 15:14:56 -04:00
Alan Rominger	1e6ca01686	Fix the callback receiver --status command	2022-05-19 15:00:49 -04:00
Alan Rominger	29d60844a8	Fix notification timing issue by sending in the latter of 2 events (#12110 ) * Track host_status_counts and use that to process notifications * Remove now unused setting * Back out changes to callback class not needed after all * Skirt the need for duck typing by leaning on the cached field * Delete tests for deleted task * Revert "Back out changes to callback class not needed after all" This reverts commit 3b8ae350d218991d42bffd65ce4baac6f41926b2. * Directly hardcode stats_event_type for callback class * Fire notifications if stats event was never sent * Remove test content for deleted methods * Add placeholder for when no hosts matched * Make field default be None, denote events processed with empty dict * Make UI process null value for host_status_counts * Fix tracking of EOF dispatch for system jobs * Reorganize EVENT_MAP into class properties * Consolidate conditional I missed from EVENT_MAP refactor * Give up on the null condition, also applies for empty hosts * Remove cls position argument not being used * Move wrapup method out of class, add tests	2022-04-29 13:54:31 -04:00
Alan Rominger	73e02e745a	Patches to make jobs robust to database restarts (#11905 ) * Simple patches to make jobs robust to database restarts * Add some wait time before retrying loop due to DB error * Apply dispatcher downtime setting to job updates, fix dispatcher bug This resolves a bug where the pg_is_down property never had the right value the loop is normally stuck in the conn.events() iterator so it never recognized successful database interactions this lead to serial database outages terminating jobs New setting for allowable PG downtime is shared with task code any calls to update_model will use _max_attempts parameter to make it align with the patience time that the dispatcher respects when consuming new events * To avoid restart loops, handle DB errors on startup with prejudice * If reconnect consistently fails, exit with non-zero code	2022-03-30 09:14:20 -04:00
Alan Rominger	fe5736dc7f	Specifically abort the reaper if instance not registered	2022-03-29 14:08:58 -04:00

1 2 3

131 Commits