Commit Graph

169 Commits

Author SHA1 Message Date
Seth Foster
5cc467d4cf [AAP-74497] Reset orphaned waiting jobs when controller node is deprovisioned (#16467)
Reset orphaned waiting jobs when controller node is deprovisioned

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-02 10:46:52 -04:00
Alan Rominger
7155400efc AAP-12516 [option 2] Handle nested workflow artifacts via root node ancestor_artifacts (#16381)
* Add new test for artfact precedence upstream node vs outer workflow

* Fix bugs, upstream artifacts come first for precedence

* Track nested artifacts path through ancestor_artifacts on root nodes

* Fix case where first root node did not get the vars

* touchup comment

* Prevent conflict with sliced jobs hack
2026-04-02 15:18:11 -04:00
Alan Rominger
0aaca1bffd Fix job cancel chain bugs (#16325)
* Fix job cancel chain bugs

* Early relief valve for canceled jobs, ATF related changes

* Add test and fix for approval nodes as well

* Revert unwanted change

* Refactor workflow approval nodes to make it more clean

* Revert data structure changes

* Delete local utility file

* Review comment addressing

* Use canceled status in websocket

* Delete slop

* Add agent marker

* Bugbot comment about status websocket mismatch
2026-03-18 12:08:27 -04:00
Jake Jackson
36a00ec46b AAP-58539 Move to dispatcherd (#16209)
* WIP First pass
* started removing feature flags and adjusting logic
* Add decorator
* moved to dispatcher decorator
* updated as many as I could find
* Keep callback receiver working
* remove any code that is not used by the call back receiver
* add back auto_max_workers
* added back get_auto_max_workers into common utils
* Remove control and hazmat (squash this not done)
* moved status out and deleted control as no longer needed
* removed unused imports
* adjusted test import to pull correct method
* fixed imports and addressed clusternode heartbeat test
* Update function comments
* Add back hazmat for config and remove baseworker
* added back hazmat per @alancoding feedback around config
* removed baseworker completely and refactored it into the callback
  worker
* Fix dispatcher run call and remove dispatch setting
* remove dispatcher mock publish setting
* Adjust heartbeat arg and more formatting
* fixed the call to cluster_node_heartbeat missing binder
* Fix attribute error in server logs
2026-01-23 20:49:32 +00:00
Alan Rominger
dce5ac73c5 Apply new rules from black update (#16232) 2026-01-19 12:58:07 -05:00
Alan Rominger
94764a1f17 AAP-42649 Flag-gated use of "dispatcherd" as its own library (#15981)
Use dynamic AWX max_workers value

Make basic --status and --running commands work

Make feature flag enabled true by default for development

* [dispatcherd] Dispatcher socket-based `--status` demo working (#15908)

* Fix Task Decorator to Work With and Without Feature Flag (AAP-41775) (#15911)

* refactor(system): extract common heartbeat helpers and split cluster_node_heartbeat

Extract common heartbeat logic into helper functions:  _heartbeat_instance_management: consolidates instance management, health checks, and lost-instance detection.  _heartbeat_check_versions: compares instance versions and initiates shutdown when necessary.  _heartbeat_handle_lost_instances: reaps jobs and marks lost instances offline.

Refactor the original cluster_node_heartbeat to use these helpers and retain legacy behavior (using bind_kwargs).

Introduce adispatch_cluster_node_heartbeat for dispatcherd: uses the control API to retrieve running tasks and reaps them.

Link the two implementations by attaching adispatch_cluster_node_heartbeat as the _new_method on cluster_node_heartbeat.

* feat(publish): delegate heartbeat task submission to new dispatcherd implementation

Update apply_async to check at runtime if FEATURE_NEW_DISPATCHER is enabled.

When the task is cluster_node_heartbeat and a _new_method is attached, delegate the task submission to the new dispatcherd implementation.

Preserve the original behavior for all other tasks and fallback on error.

* refactor(system): extract task ID retrieval from dispatcherd into helper function

Improves readability of adispatch_cluster_node_heartbeat by extracting
the complex UUID parsing logic into a dedicated helper function.
Adds clearer error handling and follows established code patterns.

* fix(dispatcher): Enable task decorator to work with and without feature flag

Implemented a new approach for handling task execution with feature flags
by attaching alternative implementations to apply_async._new_method. This
allows cluster_node_heartbeat to work correctly with both the legacy and
new dispatcher systems without modifying core decorator logic.

AAP-41775

* fix(dispatcher): Improve error handling and logging in feature flag implementation

- Add error handling when attaching alternative dispatcher implementation
- Fix method self-reference in apply_async to properly use cls.apply_async
- Document limitations of this targeted approach for specific tasks
- Add logging for better debugging of dispatcher selection
- Ensure decorator timing by keeping method attachment after function definitions

This completes the robust implementation for switching between dispatcher
implementations based on feature flags.

AAP-41775

* fix(dispatcher): Implement registry pattern for dispatcher feature flag compatibility

Replaces direct method attribute assignment with a global registry for
alternative implementations. The original approach tried to attach new
methods directly to apply_async bound methods, which fails because bound
methods don't support attribute assignment in Python.

The registry pattern:
- Creates a global ALTERNATIVE_TASK_IMPLEMENTATIONS dict in publish.py
- Registers alternative implementations by task name
- Modifies apply_async to check the registry when feature flag is enabled
- Adds extensive logging throughout the process for debugging

This enables cluster_node_heartbeat to work correctly with both the legacy
and new dispatcher implementations based on the FEATURE_NEW_DISPATCHER flag.

AAP-41775

* refactor(dispatcher): Remove excessive logging from dispatcher implementation

Reduces verbose debugging logs while maintaining essential logging for critical
operations. Preserves:
- Task implementation selection based on feature flag
- Registration success/failure messages
- Critical error reporting

Removed:
- Registry content debugging messages
- Repetitive task diagnostics
- Non-essential information logging

AAP-41775

* fix(dispatcher): Fix shallow copy in dispatcher schedule conversion

This resolves "AttributeError: 'float' object has no attribute 'total_seconds'"
errors when the dispatcher is restarted.

Refs: AAP-41775

* Use IPC mechanism to get running tasks (#15926)
* Allow tasks from tasks
* Fix failure to limit to waiting jobs
* Get job record with lock
* Fix failures in dispatcherd feature branch (#15930)
* Fully handle DispatcherCancel
* Complete rest of preload import work
* Complete dispatcherd integration & job cancellation (AAP-43033) (#15941)
* feat(dispatcher): Implement job cancellation for new dispatcher

Adds feature-flag-aware job cancellation that routes cancel requests to either
the legacy dispatcher or the new dispatcherd library based on the
FEATURE_NEW_DISPATCHER flag.

- Updates cancel_dispatcher_process() to use dispatcherd's control API when enabled
- Handles both direct cancellation and task manager workflow cancellation cases
- Works with DispatcherCancel exception handling to properly handle SIGUSR1 signals

AAP-43033

* fix(dispatcher): Update run_dispatcher.py to properly handle task cancellation

Modifies the cancel command in run_dispatcher.py to properly cancel tasks
when the FEATURE_NEW_DISPATCHER flag is enabled, rather than just listing
running tasks.

The implementation translates each task UUID to the appropriate
filter format expected by the dispatcherd control API, maintaining the same
behavior as the original implementation.

Part of: AAP-43033

* refactor(system): Refactor dispatch_startup() to extract common startup logic and branch based on feature flag

This commit refactors the dispatch_startup() function to improve clarity and consistency across the legacy
and new dispatcherd flows.

No dispatcher-specific functionality is needed beyond the changes made, so this refactoring improves robustness without
altering core behavior.

* refactor(system): Refactor inform_cluster_of_shutdown() for clarity

* refactor(tasks): Replace @task with @task_awx across 22 tasks for dispatcher compatibility

- Migrated all task decorators to use @task_awx, ensuring dispatcher-aware behavior.
- Tested each task with the new dispatcherd, verifying that tasks using the registry pattern execute correctly without needing binder‐based alternative implementations.
- Removed redundant logging and outdated comments.
- Legacy tasks that do not require special parameter extraction continue to use their original logic.
- This commit reflects our complete journey of testing and verifying dispatcherd compatibility across all 22 tasks.

* refactor(publish): fix linter

* Fix bug from the branch rebase

* AAP-43763 Add tests for connection management in dispatcherd workers (#15949)

* Add test for job cancel in live tests
* Fix bug from the branch rebase
* Add test for connection recovery after connection broke
* Add test for breaking connection

* Fix dispatcherd bugs: schedule aliases, job kwargs handling, cancel handling (#15960)

* Put in job kwargs handling, not done before

* AAP-44382 [dispatcherd] Fixes for running with feature flag off (#15973)

* Use correct decorator for test of tasks

* Finalize dispatcherd feature branch (#15975)

* Work dispatcherd into dependency management system

* Use util methods from DAB

* Rename the dispatcherd feature flag, and flip default to not-enabled

* Move to new submit_task method

* Update the location of the sock file

* AAP-44381 Make dispatcherd config loading more lazy (#15979)

* Make dispatcherd config loading more lazy

* Make submission error more obvious

* Fix signal handling gap, hijack SIGUSR1 from dispatcherd (#15983)

* Fix signal handling gap, hijack SIGUSR1 from dispatcherd

* Minor adjustments to dispatcherd status command

* [dispatcherd] Get rid of alternative task registry (#15984)

Get rid of alternative task registry

* Fix deadlock error and other cleanup errors (#15987)
* Move to proper error handling location

---------

Co-authored-by: artem_tiupin <70763601+art-tapin@users.noreply.github.com>
2025-05-16 09:39:22 -04:00
Alan Rominger
c3ee0c2d8a Sensible log behavior when redis is unavailable (#15466)
* Sensible log behavior when redis is unavailable

* Consistent behavior with dispatcher and callback
2025-04-10 13:45:05 -07:00
Alan Rominger
f57a9863d6 Use advisory_lock from DAB (#15676)
* Use advisory_lock from DAB

* Remove the django-pglocks dep

* Re-run updater script

* Move the import in new location
2025-01-15 14:06:59 -05:00
Alan Rominger
f4cbb9f9a8 Fix bug where unrelated jobs were linked as dependencies (#15610) 2024-11-06 14:43:36 -05:00
Hao Liu
6f2307f50e Add TASK_MANAGER_LOCK_TIMEOUT (#15300)
* Add TASK_MANAGER_LOCK_TIMEOUT

`TASK_MANAGER_LOCK_TIMEOUT` controls the `idle_in_transaction_session_timeout` and `idle_session_timeout` configuration for task manager connections and lock in database

hope to prevent the situation that the task instance that holds the lock becomes unresponsive and preventing other instance to be able to run task manager

* Add session timeout to periodic scheduler and all sub task manager locks
2024-06-27 09:42:41 -04:00
Chris Meyers
8a902debd5 Per-service metrics http server
* Organize metrics into their respective service
* Server per-service metrics on a per-service http server
* Increase prometheus client usage over our custom metrics fields
2024-02-05 15:17:24 -05:00
jessicamack
209747d88e Update for django-ansible-base split (#14783)
* update paths and names

* temp to get tests passing

* fix typo
2024-01-19 12:30:32 -05:00
John Westcott IV
aacf9653c5 Use filtering/sorting from django-ansible-base (#14726)
* Move filtering to DAB

* add comment to trigger building a new image

Signed-off-by: jessicamack <jmack@redhat.com>

* remove unneeded comment

Signed-off-by: jessicamack <jmack@redhat.com>

* remove unused imports

Signed-off-by: jessicamack <jmack@redhat.com>

* change mock import

Signed-off-by: jessicamack <jmack@redhat.com>

---------

Signed-off-by: jessicamack <jmack@redhat.com>
Co-authored-by: jessicamack <jmack@redhat.com>
2023-12-18 10:05:02 -05:00
Alan Rominger
333ef76cbd Send notifications for dependency failures (#14603)
* Send notifications for dependency failures

* Delete tests for deleted method

* Remove another test for removed method
2023-10-30 10:42:37 -04:00
Hao Liu
bb3acbb8ad Debug log for scheduler commit duration (#14035)
Co-authored-by: Alan Rominger <arominge@redhat.com>
2023-09-27 09:46:55 -04:00
Alan Rominger
ab5cc2e69c Simplifications for DependencyManager (#13533) 2023-07-27 15:42:29 -04:00
Alan Rominger
98bfe3f43f Add missing trigger for failed-to-start nodes (#13802) 2023-07-24 12:17:46 -04:00
Rick Elrod
48edb15a03 Prevent Dispatcher deadlock when Redis disappears (#14249)
This fixes https://github.com/ansible/awx/issues/14245 which has
more information about this issue.

This change addresses both:
- A clashing signal handler (registering a callback to fire when
  the task manager times out, and hitting that callback in cases
  where we didn't expect to). Make dispatcher timeout use
  SIGUSR1, not SIGTERM.
- Metrics not being reported should not make us crash, so that is
  now fixed as well.

Signed-off-by: Rick Elrod <rick@elrod.me>
Co-authored-by: Alan Rominger <arominge@redhat.com>
2023-07-18 10:43:46 -05:00
Alan Rominger
94b34b801c Avoid unbounded kwargs by fetching subtasks inside handle_work_error
Update tests to new handle_work_error call pattern

Handle blame correctly with multiple serial deps
  add new test case corresponding to this scenario
2022-12-19 16:02:51 -05:00
Elijah DeLee
86856f242a Add max concurrent jobs and max forks per ig
The intention of this feature is primarily to provide some notion of max
capacity of container groups, but the logic I've left generic. Default
is 0, which will be interpereted as no maximum number of jobs or forks.

Includes refactor of variable and method names for clarity.
instances_by_hostname is an internal attribute of TaskManagerInstances.
Clarify when we are expecting the actual TaskManagerInstances object.

Unify how we process running tasks and consume capacity. This has the
effect that we do less expensive work in after_lock_init and have 1 less
loop over all the running tasks. Previously we looped for both building
the dependency graph as well as for calculating the starting capacity of
all the instances and instance groups. Now we acheive both tasks in the
same loop.

Because of how this changes the somewhat subtle "do-si-do" of how to
initialize the Task Manager models, introduce a wrapper class that tries
to take some of that burden off of other areas where we re-use this like
in the serializer and the metrics. Also use this wrapper class to handle
nicities of how to track capacity consumption on instances and instance
groups.

Add tests for max_forks and max_concurrent_jobs

Fixup tests that use TaskManagerModels to accomodate changes.

assign ig before call to consume capacity

if we don't do it in that order, then we don't correctly account for
the container group jobs we are starting in the middle of the task
manager run
2022-11-30 17:14:33 -05:00
Alan Rominger
cfce31419d Move the IS_TESTING method out of settings 2022-09-28 11:19:10 -04:00
Alan Rominger
5648d9d96f Avoid cache warning for dispatching control type tasks 2022-09-27 15:18:13 -04:00
Shane McDonald
9b034ad574 generate control node receptor.conf
when a new remote execution/hop node is added
regenerate the receptor.conf for all control node to
peer out to the new remote execution node

Signed-off-by: Hao Liu <haoli@redhat.com>
Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com>
Co-Authored-By: Shane McDonald <me@shanemcd.com>
2022-09-23 09:46:12 -04:00
Alan Rominger
e87fabe6bb Submit job to dispatcher as part of transaction (#12573)
Make it so that submitting a task to the dispatcher happens as part of the transaction.
  this applies to dispatcher task "publishers" which NOTIFY the pg_notify queue
  if the transaction is not successful, it will not be sent, as per postgres docs

This keeps current behavior for pg_notify listeners
  practically, this only applies for the awx-manage run_dispatcher service
  this requires creating a separate connection and keeping it long-lived
  arbitrary code will occasionally close the main connection, which would stop listening

Stop sending the waiting status websocket message
  this is required because the ordering cannot be maintained with other changes here
  the instance group data is moved to the running websocket message payload

Move call to create_partition from task manager to pre_run_hook
  mock this in relevant unit tests
2022-08-18 09:43:53 -04:00
Seth Foster
55d295c2a6 Add metric to measure task manager transaction, including on_commit calls 2022-08-15 12:44:29 -04:00
Alan Rominger
f7e6a32444 Optimize task manager with debug toolbar, adjust prefetch (#12588) 2022-08-10 10:05:13 -04:00
Seth Foster
e6f8852b05 Cache task_impact
task_impact is now a field on the database
It is calculated and set during create_unified_job

set task_impact on .save for adhoc commands
2022-08-05 14:33:47 -04:00
Seth Foster
957b2b7188 Cache preferred instance groups
When creating unified job, stash the list of pk values from the
instance groups returned from preferred_instance_groups so that the
task management system does not need to call out to this method
repeatedly.

.preferred_instance_groups_cache is the new field
2022-08-05 14:33:28 -04:00
Alan Rominger
b94b3a1e91 [task_manager_refactor] Move approval node expiration logic into queryset (#12502)
Instead of loading all pending Workflow Approvals in the task manager,
  run a query that will only return the expired apporovals
  directly expire all which are returned by that query

Cache expires time as a new field in order to simplify WorkflowApproval filter
2022-08-05 14:33:27 -04:00
Elijah DeLee
7776a81e22 add job to dependency graph in start task
We always add the job to the graph right before calling start task.
Reduce complexity of proper operation by just doing this in start_task,
because if you call start_task, you need to add it to the dependency
graph
2022-08-05 14:33:26 -04:00
Elijah DeLee
bf89093fac unify call pattern for get_tasks 2022-08-05 14:33:26 -04:00
Elijah DeLee
76d76d13b0 Start pending workflows in TaskManager
we had tried doing this in the WorkflowManager, but we decided that
we want to handle ALL pending jobs and "soft blockers" to jobs with the
TaskManager/DependencyGraph and not duplicate that logic in the
WorkflowManager.
2022-08-05 14:33:26 -04:00
Seth Foster
0a47d05d26 split schedule_task_manager into 3
each call to schedule_task_manager becomes one of

ScheduleTaskManager
ScheduleDependencyManager
ScheduleWorkflowManager
2022-08-05 14:33:25 -04:00
Seth Foster
b3eb9e0193 pid kill each of the 3 task managers on timeout 2022-08-05 14:33:25 -04:00
Elijah DeLee
b26d2ab0e9 fix looking at wrong id for wf allow_simultaneous 2022-08-05 14:33:25 -04:00
Elijah DeLee
7eb0c7dd28 exit task manager loops early if we are timed out
add settings to define task manager timeout and grace period

This gives us still TASK_MANAGER_TIMEOUT_GRACE_PERIOD amount of time to
get out of the task manager.

Also, apply start task limit in WorkflowManager to starting pending
workflows
2022-08-05 14:33:24 -04:00
Elijah DeLee
236c1df676 fix lint errors 2022-08-05 14:33:24 -04:00
Seth Foster
ff118f2177 Manage pending workflow jobs in Workflow Manager
get_tasks uses UnifiedJob
Additionally, make local overrides run after development settings
2022-08-05 14:31:48 -04:00
Elijah DeLee
29d91da1d2 we can do all the work in one loop
more than saving the loop, we save building the WorkflowDag twice which
makes LOTS of queries!!!

Also, do a bulk update on the WorkflowJobNodes instead of saving in a
loop :fear:
2022-08-05 14:31:48 -04:00
Seth Foster
431b9370df Split TaskManager into
- DependencyManager spawns dependencies if necessary
- WorkflowManager processes running workflows to see if a new job is
  ready to spawn
- TaskManager starts tasks if unblocked and has execution capacity
2022-08-05 14:29:02 -04:00
Alan Rominger
783b744bdb Pass combined artifacts from nested workflows into downstream nodes (#12223)
* Track combined artifacts on workflow jobs

* Avoid schema change for passing nested workflow artifacts

* Basic support for nested workflow artifacts, add test

* Forgot that only does not work with polymorphic

* Remove incorrect field

* Consolidate logic and prevent recursion with UJ artifacts method

* Stop trying to do precedence by status, filter for obvious ones

* Review comments about sets

* Fix up bug with convergence node paths and artifacts
2022-06-23 16:54:53 -03:00
Seth Foster
2f82b75748 Add subsystem metrics for task manager 2022-06-14 11:00:11 -04:00
Seth Foster
eba4a3f1c2 in case we fail a job in task manager, we need to add the project update to the inventoryupdate.source_project field 2022-05-12 15:21:17 -04:00
Seth Foster
0ae9fe3624 if dependency fails, fail job in task manager 2022-05-12 14:00:13 -04:00
Seth Foster
1b662fcca5 SCM inv source trigger project update
- scm based inventory sources should launch project updates prior to
running inventory updates for that source.
- fixes scenario where a job is based on projectA, but the inventory
source is based on projectB. Running the job will likely trigger a
sync for projectA, but not projectB.

comments
2022-05-12 14:00:12 -04:00
Alan Rominger
cb63d92bbf Remove committed_capacity field, delete supporting code (#12086)
* Remove committed_capacity field, delete supporting code

* Track consumed capacity to solve the negatives problem

* Use more verbose name for IG queryset
2022-04-22 13:41:32 -04:00
Elijah DeLee
689a216726 move static methods used by task manager (#12050)
* move static methods used by task manager

These static methods were being used to act on Instance-like objects
that were SimpleNamespace objects with the necessary attributes.

This change introduces dedicated classes to replace the SimpleNamespace
objects and moves the formerlly staticmethods to a place where they are
more relevant instead of tacked onto models to which they were only
loosly related.

Accept in-memory data structure in init methods for tests

* initialize remaining capacity AFTER we built map of instances
2022-04-21 13:05:06 -04:00
Elijah DeLee
e24fc43a45 Revert "Only fetch fields we need in task manager"
This reverts commit 868e811b3f.

Turns out this does not play well with polymorphic models.

Will try again with .defer()
2022-04-14 11:55:33 -04:00
Elijah DeLee
868e811b3f Only fetch fields we need in task manager
By using .only we select fewer columns, avoiding potentially large
fields that we never reference.

Also, small tweak to eliminate what was a duplicate dictionary of
hostname:instance, because we don't need build and carry two copies of
the same data.
2022-04-13 17:24:33 -04:00
Elijah DeLee
2e9974133a calculate remaining capacity in static method
this is to avoid additional queries when we allready have all
the active jobs fetched in the task manager
2022-04-13 11:56:07 -04:00