External-Mirrors/awx

mirror of https://github.com/ansible/awx.git synced 2026-06-22 23:27:46 -02:30

Author	SHA1	Message	Date
Seth Foster	5cc467d4cf	[AAP-74497] Reset orphaned waiting jobs when controller node is deprovisioned (#16467 ) Reset orphaned waiting jobs when controller node is deprovisioned Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-02 10:46:52 -04:00
Alan Rominger	7155400efc	AAP-12516 [option 2] Handle nested workflow artifacts via root node `ancestor_artifacts` (#16381 ) * Add new test for artfact precedence upstream node vs outer workflow * Fix bugs, upstream artifacts come first for precedence * Track nested artifacts path through ancestor_artifacts on root nodes * Fix case where first root node did not get the vars * touchup comment * Prevent conflict with sliced jobs hack	2026-04-02 15:18:11 -04:00
Alan Rominger	0aaca1bffd	Fix job cancel chain bugs (#16325 ) * Fix job cancel chain bugs * Early relief valve for canceled jobs, ATF related changes * Add test and fix for approval nodes as well * Revert unwanted change * Refactor workflow approval nodes to make it more clean * Revert data structure changes * Delete local utility file * Review comment addressing * Use canceled status in websocket * Delete slop * Add agent marker * Bugbot comment about status websocket mismatch	2026-03-18 12:08:27 -04:00
Jake Jackson	36a00ec46b	AAP-58539 Move to dispatcherd (#16209 ) * WIP First pass * started removing feature flags and adjusting logic * Add decorator * moved to dispatcher decorator * updated as many as I could find * Keep callback receiver working * remove any code that is not used by the call back receiver * add back auto_max_workers * added back get_auto_max_workers into common utils * Remove control and hazmat (squash this not done) * moved status out and deleted control as no longer needed * removed unused imports * adjusted test import to pull correct method * fixed imports and addressed clusternode heartbeat test * Update function comments * Add back hazmat for config and remove baseworker * added back hazmat per @alancoding feedback around config * removed baseworker completely and refactored it into the callback worker * Fix dispatcher run call and remove dispatch setting * remove dispatcher mock publish setting * Adjust heartbeat arg and more formatting * fixed the call to cluster_node_heartbeat missing binder * Fix attribute error in server logs	2026-01-23 20:49:32 +00:00
Alan Rominger	dce5ac73c5	Apply new rules from black update (#16232 )	2026-01-19 12:58:07 -05:00
Alan Rominger	94764a1f17	AAP-42649 Flag-gated use of "dispatcherd" as its own library (#15981 ) Use dynamic AWX max_workers value Make basic --status and --running commands work Make feature flag enabled true by default for development * [dispatcherd] Dispatcher socket-based `--status` demo working (#15908) * Fix Task Decorator to Work With and Without Feature Flag (AAP-41775) (#15911) * refactor(system): extract common heartbeat helpers and split cluster_node_heartbeat Extract common heartbeat logic into helper functions: _heartbeat_instance_management: consolidates instance management, health checks, and lost-instance detection. _heartbeat_check_versions: compares instance versions and initiates shutdown when necessary. _heartbeat_handle_lost_instances: reaps jobs and marks lost instances offline. Refactor the original cluster_node_heartbeat to use these helpers and retain legacy behavior (using bind_kwargs). Introduce adispatch_cluster_node_heartbeat for dispatcherd: uses the control API to retrieve running tasks and reaps them. Link the two implementations by attaching adispatch_cluster_node_heartbeat as the _new_method on cluster_node_heartbeat. * feat(publish): delegate heartbeat task submission to new dispatcherd implementation Update apply_async to check at runtime if FEATURE_NEW_DISPATCHER is enabled. When the task is cluster_node_heartbeat and a _new_method is attached, delegate the task submission to the new dispatcherd implementation. Preserve the original behavior for all other tasks and fallback on error. * refactor(system): extract task ID retrieval from dispatcherd into helper function Improves readability of adispatch_cluster_node_heartbeat by extracting the complex UUID parsing logic into a dedicated helper function. Adds clearer error handling and follows established code patterns. * fix(dispatcher): Enable task decorator to work with and without feature flag Implemented a new approach for handling task execution with feature flags by attaching alternative implementations to apply_async._new_method. This allows cluster_node_heartbeat to work correctly with both the legacy and new dispatcher systems without modifying core decorator logic. AAP-41775 * fix(dispatcher): Improve error handling and logging in feature flag implementation - Add error handling when attaching alternative dispatcher implementation - Fix method self-reference in apply_async to properly use cls.apply_async - Document limitations of this targeted approach for specific tasks - Add logging for better debugging of dispatcher selection - Ensure decorator timing by keeping method attachment after function definitions This completes the robust implementation for switching between dispatcher implementations based on feature flags. AAP-41775 * fix(dispatcher): Implement registry pattern for dispatcher feature flag compatibility Replaces direct method attribute assignment with a global registry for alternative implementations. The original approach tried to attach new methods directly to apply_async bound methods, which fails because bound methods don't support attribute assignment in Python. The registry pattern: - Creates a global ALTERNATIVE_TASK_IMPLEMENTATIONS dict in publish.py - Registers alternative implementations by task name - Modifies apply_async to check the registry when feature flag is enabled - Adds extensive logging throughout the process for debugging This enables cluster_node_heartbeat to work correctly with both the legacy and new dispatcher implementations based on the FEATURE_NEW_DISPATCHER flag. AAP-41775 * refactor(dispatcher): Remove excessive logging from dispatcher implementation Reduces verbose debugging logs while maintaining essential logging for critical operations. Preserves: - Task implementation selection based on feature flag - Registration success/failure messages - Critical error reporting Removed: - Registry content debugging messages - Repetitive task diagnostics - Non-essential information logging AAP-41775 * fix(dispatcher): Fix shallow copy in dispatcher schedule conversion This resolves "AttributeError: 'float' object has no attribute 'total_seconds'" errors when the dispatcher is restarted. Refs: AAP-41775 * Use IPC mechanism to get running tasks (#15926) * Allow tasks from tasks * Fix failure to limit to waiting jobs * Get job record with lock * Fix failures in dispatcherd feature branch (#15930) * Fully handle DispatcherCancel * Complete rest of preload import work * Complete dispatcherd integration & job cancellation (AAP-43033) (#15941) * feat(dispatcher): Implement job cancellation for new dispatcher Adds feature-flag-aware job cancellation that routes cancel requests to either the legacy dispatcher or the new dispatcherd library based on the FEATURE_NEW_DISPATCHER flag. - Updates cancel_dispatcher_process() to use dispatcherd's control API when enabled - Handles both direct cancellation and task manager workflow cancellation cases - Works with DispatcherCancel exception handling to properly handle SIGUSR1 signals AAP-43033 * fix(dispatcher): Update run_dispatcher.py to properly handle task cancellation Modifies the cancel command in run_dispatcher.py to properly cancel tasks when the FEATURE_NEW_DISPATCHER flag is enabled, rather than just listing running tasks. The implementation translates each task UUID to the appropriate filter format expected by the dispatcherd control API, maintaining the same behavior as the original implementation. Part of: AAP-43033 * refactor(system): Refactor dispatch_startup() to extract common startup logic and branch based on feature flag This commit refactors the dispatch_startup() function to improve clarity and consistency across the legacy and new dispatcherd flows. No dispatcher-specific functionality is needed beyond the changes made, so this refactoring improves robustness without altering core behavior. * refactor(system): Refactor inform_cluster_of_shutdown() for clarity * refactor(tasks): Replace @task with @task_awx across 22 tasks for dispatcher compatibility - Migrated all task decorators to use @task_awx, ensuring dispatcher-aware behavior. - Tested each task with the new dispatcherd, verifying that tasks using the registry pattern execute correctly without needing binder‐based alternative implementations. - Removed redundant logging and outdated comments. - Legacy tasks that do not require special parameter extraction continue to use their original logic. - This commit reflects our complete journey of testing and verifying dispatcherd compatibility across all 22 tasks. * refactor(publish): fix linter * Fix bug from the branch rebase * AAP-43763 Add tests for connection management in dispatcherd workers (#15949) * Add test for job cancel in live tests * Fix bug from the branch rebase * Add test for connection recovery after connection broke * Add test for breaking connection * Fix dispatcherd bugs: schedule aliases, job kwargs handling, cancel handling (#15960) * Put in job kwargs handling, not done before * AAP-44382 [dispatcherd] Fixes for running with feature flag off (#15973) * Use correct decorator for test of tasks * Finalize dispatcherd feature branch (#15975) * Work dispatcherd into dependency management system * Use util methods from DAB * Rename the dispatcherd feature flag, and flip default to not-enabled * Move to new submit_task method * Update the location of the sock file * AAP-44381 Make dispatcherd config loading more lazy (#15979) * Make dispatcherd config loading more lazy * Make submission error more obvious * Fix signal handling gap, hijack SIGUSR1 from dispatcherd (#15983) * Fix signal handling gap, hijack SIGUSR1 from dispatcherd * Minor adjustments to dispatcherd status command * [dispatcherd] Get rid of alternative task registry (#15984) Get rid of alternative task registry * Fix deadlock error and other cleanup errors (#15987) * Move to proper error handling location --------- Co-authored-by: artem_tiupin <70763601+art-tapin@users.noreply.github.com>	2025-05-16 09:39:22 -04:00
Alan Rominger	c3ee0c2d8a	Sensible log behavior when redis is unavailable (#15466 ) * Sensible log behavior when redis is unavailable * Consistent behavior with dispatcher and callback	2025-04-10 13:45:05 -07:00
Alan Rominger	f57a9863d6	Use advisory_lock from DAB (#15676 ) * Use advisory_lock from DAB * Remove the django-pglocks dep * Re-run updater script * Move the import in new location	2025-01-15 14:06:59 -05:00
Alan Rominger	f4cbb9f9a8	Fix bug where unrelated jobs were linked as dependencies (#15610 )	2024-11-06 14:43:36 -05:00
Hao Liu	6f2307f50e	Add TASK_MANAGER_LOCK_TIMEOUT (#15300 ) * Add TASK_MANAGER_LOCK_TIMEOUT `TASK_MANAGER_LOCK_TIMEOUT` controls the `idle_in_transaction_session_timeout` and `idle_session_timeout` configuration for task manager connections and lock in database hope to prevent the situation that the task instance that holds the lock becomes unresponsive and preventing other instance to be able to run task manager * Add session timeout to periodic scheduler and all sub task manager locks	2024-06-27 09:42:41 -04:00
Chris Meyers	8a902debd5	Per-service metrics http server * Organize metrics into their respective service * Server per-service metrics on a per-service http server * Increase prometheus client usage over our custom metrics fields	2024-02-05 15:17:24 -05:00
jessicamack	209747d88e	Update for django-ansible-base split (#14783 ) * update paths and names * temp to get tests passing * fix typo	2024-01-19 12:30:32 -05:00
John Westcott IV	aacf9653c5	Use filtering/sorting from django-ansible-base (#14726 ) * Move filtering to DAB * add comment to trigger building a new image Signed-off-by: jessicamack <jmack@redhat.com> * remove unneeded comment Signed-off-by: jessicamack <jmack@redhat.com> * remove unused imports Signed-off-by: jessicamack <jmack@redhat.com> * change mock import Signed-off-by: jessicamack <jmack@redhat.com> --------- Signed-off-by: jessicamack <jmack@redhat.com> Co-authored-by: jessicamack <jmack@redhat.com>	2023-12-18 10:05:02 -05:00
Alan Rominger	333ef76cbd	Send notifications for dependency failures (#14603 ) * Send notifications for dependency failures * Delete tests for deleted method * Remove another test for removed method	2023-10-30 10:42:37 -04:00
Hao Liu	bb3acbb8ad	Debug log for scheduler commit duration (#14035 ) Co-authored-by: Alan Rominger <arominge@redhat.com>	2023-09-27 09:46:55 -04:00
Alan Rominger	ab5cc2e69c	Simplifications for DependencyManager (#13533 )	2023-07-27 15:42:29 -04:00
Alan Rominger	98bfe3f43f	Add missing trigger for failed-to-start nodes (#13802 )	2023-07-24 12:17:46 -04:00
Rick Elrod	48edb15a03	Prevent Dispatcher deadlock when Redis disappears (#14249 ) This fixes https://github.com/ansible/awx/issues/14245 which has more information about this issue. This change addresses both: - A clashing signal handler (registering a callback to fire when the task manager times out, and hitting that callback in cases where we didn't expect to). Make dispatcher timeout use SIGUSR1, not SIGTERM. - Metrics not being reported should not make us crash, so that is now fixed as well. Signed-off-by: Rick Elrod <rick@elrod.me> Co-authored-by: Alan Rominger <arominge@redhat.com>	2023-07-18 10:43:46 -05:00
Alan Rominger	94b34b801c	Avoid unbounded kwargs by fetching subtasks inside handle_work_error Update tests to new handle_work_error call pattern Handle blame correctly with multiple serial deps add new test case corresponding to this scenario	2022-12-19 16:02:51 -05:00
Elijah DeLee	86856f242a	Add max concurrent jobs and max forks per ig The intention of this feature is primarily to provide some notion of max capacity of container groups, but the logic I've left generic. Default is 0, which will be interpereted as no maximum number of jobs or forks. Includes refactor of variable and method names for clarity. instances_by_hostname is an internal attribute of TaskManagerInstances. Clarify when we are expecting the actual TaskManagerInstances object. Unify how we process running tasks and consume capacity. This has the effect that we do less expensive work in after_lock_init and have 1 less loop over all the running tasks. Previously we looped for both building the dependency graph as well as for calculating the starting capacity of all the instances and instance groups. Now we acheive both tasks in the same loop. Because of how this changes the somewhat subtle "do-si-do" of how to initialize the Task Manager models, introduce a wrapper class that tries to take some of that burden off of other areas where we re-use this like in the serializer and the metrics. Also use this wrapper class to handle nicities of how to track capacity consumption on instances and instance groups. Add tests for max_forks and max_concurrent_jobs Fixup tests that use TaskManagerModels to accomodate changes. assign ig before call to consume capacity if we don't do it in that order, then we don't correctly account for the container group jobs we are starting in the middle of the task manager run	2022-11-30 17:14:33 -05:00
Alan Rominger	cfce31419d	Move the IS_TESTING method out of settings	2022-09-28 11:19:10 -04:00
Alan Rominger	5648d9d96f	Avoid cache warning for dispatching control type tasks	2022-09-27 15:18:13 -04:00
Shane McDonald	9b034ad574	generate control node receptor.conf when a new remote execution/hop node is added regenerate the receptor.conf for all control node to peer out to the new remote execution node Signed-off-by: Hao Liu <haoli@redhat.com> Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com> Co-Authored-By: Shane McDonald <me@shanemcd.com>	2022-09-23 09:46:12 -04:00
Alan Rominger	e87fabe6bb	Submit job to dispatcher as part of transaction (#12573 ) Make it so that submitting a task to the dispatcher happens as part of the transaction. this applies to dispatcher task "publishers" which NOTIFY the pg_notify queue if the transaction is not successful, it will not be sent, as per postgres docs This keeps current behavior for pg_notify listeners practically, this only applies for the awx-manage run_dispatcher service this requires creating a separate connection and keeping it long-lived arbitrary code will occasionally close the main connection, which would stop listening Stop sending the waiting status websocket message this is required because the ordering cannot be maintained with other changes here the instance group data is moved to the running websocket message payload Move call to create_partition from task manager to pre_run_hook mock this in relevant unit tests	2022-08-18 09:43:53 -04:00
Seth Foster	55d295c2a6	Add metric to measure task manager transaction, including on_commit calls	2022-08-15 12:44:29 -04:00
Alan Rominger	f7e6a32444	Optimize task manager with debug toolbar, adjust prefetch (#12588 )	2022-08-10 10:05:13 -04:00
Seth Foster	e6f8852b05	Cache task_impact task_impact is now a field on the database It is calculated and set during create_unified_job set task_impact on .save for adhoc commands	2022-08-05 14:33:47 -04:00
Seth Foster	957b2b7188	Cache preferred instance groups When creating unified job, stash the list of pk values from the instance groups returned from preferred_instance_groups so that the task management system does not need to call out to this method repeatedly. .preferred_instance_groups_cache is the new field	2022-08-05 14:33:28 -04:00
Alan Rominger	b94b3a1e91	[task_manager_refactor] Move approval node expiration logic into queryset (#12502 ) Instead of loading all pending Workflow Approvals in the task manager, run a query that will only return the expired apporovals directly expire all which are returned by that query Cache expires time as a new field in order to simplify WorkflowApproval filter	2022-08-05 14:33:27 -04:00
Elijah DeLee	7776a81e22	add job to dependency graph in start task We always add the job to the graph right before calling start task. Reduce complexity of proper operation by just doing this in start_task, because if you call start_task, you need to add it to the dependency graph	2022-08-05 14:33:26 -04:00
Elijah DeLee	bf89093fac	unify call pattern for get_tasks	2022-08-05 14:33:26 -04:00
Elijah DeLee	76d76d13b0	Start pending workflows in TaskManager we had tried doing this in the WorkflowManager, but we decided that we want to handle ALL pending jobs and "soft blockers" to jobs with the TaskManager/DependencyGraph and not duplicate that logic in the WorkflowManager.	2022-08-05 14:33:26 -04:00
Seth Foster	0a47d05d26	split schedule_task_manager into 3 each call to schedule_task_manager becomes one of ScheduleTaskManager ScheduleDependencyManager ScheduleWorkflowManager	2022-08-05 14:33:25 -04:00
Seth Foster	b3eb9e0193	pid kill each of the 3 task managers on timeout	2022-08-05 14:33:25 -04:00
Elijah DeLee	b26d2ab0e9	fix looking at wrong id for wf allow_simultaneous	2022-08-05 14:33:25 -04:00
Elijah DeLee	7eb0c7dd28	exit task manager loops early if we are timed out add settings to define task manager timeout and grace period This gives us still TASK_MANAGER_TIMEOUT_GRACE_PERIOD amount of time to get out of the task manager. Also, apply start task limit in WorkflowManager to starting pending workflows	2022-08-05 14:33:24 -04:00
Elijah DeLee	236c1df676	fix lint errors	2022-08-05 14:33:24 -04:00
Seth Foster	ff118f2177	Manage pending workflow jobs in Workflow Manager get_tasks uses UnifiedJob Additionally, make local overrides run after development settings	2022-08-05 14:31:48 -04:00
Elijah DeLee	29d91da1d2	we can do all the work in one loop more than saving the loop, we save building the WorkflowDag twice which makes LOTS of queries!!! Also, do a bulk update on the WorkflowJobNodes instead of saving in a loop :fear:	2022-08-05 14:31:48 -04:00
Seth Foster	431b9370df	Split TaskManager into - DependencyManager spawns dependencies if necessary - WorkflowManager processes running workflows to see if a new job is ready to spawn - TaskManager starts tasks if unblocked and has execution capacity	2022-08-05 14:29:02 -04:00
Alan Rominger	783b744bdb	Pass combined artifacts from nested workflows into downstream nodes (#12223 ) * Track combined artifacts on workflow jobs * Avoid schema change for passing nested workflow artifacts * Basic support for nested workflow artifacts, add test * Forgot that only does not work with polymorphic * Remove incorrect field * Consolidate logic and prevent recursion with UJ artifacts method * Stop trying to do precedence by status, filter for obvious ones * Review comments about sets * Fix up bug with convergence node paths and artifacts	2022-06-23 16:54:53 -03:00
Seth Foster	2f82b75748	Add subsystem metrics for task manager	2022-06-14 11:00:11 -04:00
Seth Foster	eba4a3f1c2	in case we fail a job in task manager, we need to add the project update to the inventoryupdate.source_project field	2022-05-12 15:21:17 -04:00
Seth Foster	0ae9fe3624	if dependency fails, fail job in task manager	2022-05-12 14:00:13 -04:00
Seth Foster	1b662fcca5	SCM inv source trigger project update - scm based inventory sources should launch project updates prior to running inventory updates for that source. - fixes scenario where a job is based on projectA, but the inventory source is based on projectB. Running the job will likely trigger a sync for projectA, but not projectB. comments	2022-05-12 14:00:12 -04:00
Alan Rominger	cb63d92bbf	Remove committed_capacity field, delete supporting code (#12086 ) * Remove committed_capacity field, delete supporting code * Track consumed capacity to solve the negatives problem * Use more verbose name for IG queryset	2022-04-22 13:41:32 -04:00
Elijah DeLee	689a216726	move static methods used by task manager (#12050 ) * move static methods used by task manager These static methods were being used to act on Instance-like objects that were SimpleNamespace objects with the necessary attributes. This change introduces dedicated classes to replace the SimpleNamespace objects and moves the formerlly staticmethods to a place where they are more relevant instead of tacked onto models to which they were only loosly related. Accept in-memory data structure in init methods for tests * initialize remaining capacity AFTER we built map of instances	2022-04-21 13:05:06 -04:00
Elijah DeLee	e24fc43a45	Revert "Only fetch fields we need in task manager" This reverts commit `868e811b3f`. Turns out this does not play well with polymorphic models. Will try again with .defer()	2022-04-14 11:55:33 -04:00
Elijah DeLee	868e811b3f	Only fetch fields we need in task manager By using .only we select fewer columns, avoiding potentially large fields that we never reference. Also, small tweak to eliminate what was a duplicate dictionary of hostname:instance, because we don't need build and carry two copies of the same data.	2022-04-13 17:24:33 -04:00
Elijah DeLee	2e9974133a	calculate remaining capacity in static method this is to avoid additional queries when we allready have all the active jobs fetched in the task manager	2022-04-13 11:56:07 -04:00

1 2 3 4

169 Commits