Commit Graph

62 Commits

Author SHA1 Message Date
Jake Jackson
36a00ec46b AAP-58539 Move to dispatcherd (#16209)
* WIP First pass
* started removing feature flags and adjusting logic
* Add decorator
* moved to dispatcher decorator
* updated as many as I could find
* Keep callback receiver working
* remove any code that is not used by the call back receiver
* add back auto_max_workers
* added back get_auto_max_workers into common utils
* Remove control and hazmat (squash this not done)
* moved status out and deleted control as no longer needed
* removed unused imports
* adjusted test import to pull correct method
* fixed imports and addressed clusternode heartbeat test
* Update function comments
* Add back hazmat for config and remove baseworker
* added back hazmat per @alancoding feedback around config
* removed baseworker completely and refactored it into the callback
  worker
* Fix dispatcher run call and remove dispatch setting
* remove dispatcher mock publish setting
* Adjust heartbeat arg and more formatting
* fixed the call to cluster_node_heartbeat missing binder
* Fix attribute error in server logs
2026-01-23 20:49:32 +00:00
Seth Foster
2fa2cd8beb Add timeout and on duplicate to system tasks (#16169)
Modify the invocation of @task_awx to accept timeout and
on_duplicate keyword arguments. These arguments are
only used in the new dispatcher implementation.

Add decorator params:
- timeout
- on_duplicate

to tasks to ensure better recovery for
stuck or long-running processes.

---------

Signed-off-by: Seth Foster <fosterbseth@gmail.com>
2025-11-12 23:18:57 -05:00
Alan Rominger
94764a1f17 AAP-42649 Flag-gated use of "dispatcherd" as its own library (#15981)
Use dynamic AWX max_workers value

Make basic --status and --running commands work

Make feature flag enabled true by default for development

* [dispatcherd] Dispatcher socket-based `--status` demo working (#15908)

* Fix Task Decorator to Work With and Without Feature Flag (AAP-41775) (#15911)

* refactor(system): extract common heartbeat helpers and split cluster_node_heartbeat

Extract common heartbeat logic into helper functions:  _heartbeat_instance_management: consolidates instance management, health checks, and lost-instance detection.  _heartbeat_check_versions: compares instance versions and initiates shutdown when necessary.  _heartbeat_handle_lost_instances: reaps jobs and marks lost instances offline.

Refactor the original cluster_node_heartbeat to use these helpers and retain legacy behavior (using bind_kwargs).

Introduce adispatch_cluster_node_heartbeat for dispatcherd: uses the control API to retrieve running tasks and reaps them.

Link the two implementations by attaching adispatch_cluster_node_heartbeat as the _new_method on cluster_node_heartbeat.

* feat(publish): delegate heartbeat task submission to new dispatcherd implementation

Update apply_async to check at runtime if FEATURE_NEW_DISPATCHER is enabled.

When the task is cluster_node_heartbeat and a _new_method is attached, delegate the task submission to the new dispatcherd implementation.

Preserve the original behavior for all other tasks and fallback on error.

* refactor(system): extract task ID retrieval from dispatcherd into helper function

Improves readability of adispatch_cluster_node_heartbeat by extracting
the complex UUID parsing logic into a dedicated helper function.
Adds clearer error handling and follows established code patterns.

* fix(dispatcher): Enable task decorator to work with and without feature flag

Implemented a new approach for handling task execution with feature flags
by attaching alternative implementations to apply_async._new_method. This
allows cluster_node_heartbeat to work correctly with both the legacy and
new dispatcher systems without modifying core decorator logic.

AAP-41775

* fix(dispatcher): Improve error handling and logging in feature flag implementation

- Add error handling when attaching alternative dispatcher implementation
- Fix method self-reference in apply_async to properly use cls.apply_async
- Document limitations of this targeted approach for specific tasks
- Add logging for better debugging of dispatcher selection
- Ensure decorator timing by keeping method attachment after function definitions

This completes the robust implementation for switching between dispatcher
implementations based on feature flags.

AAP-41775

* fix(dispatcher): Implement registry pattern for dispatcher feature flag compatibility

Replaces direct method attribute assignment with a global registry for
alternative implementations. The original approach tried to attach new
methods directly to apply_async bound methods, which fails because bound
methods don't support attribute assignment in Python.

The registry pattern:
- Creates a global ALTERNATIVE_TASK_IMPLEMENTATIONS dict in publish.py
- Registers alternative implementations by task name
- Modifies apply_async to check the registry when feature flag is enabled
- Adds extensive logging throughout the process for debugging

This enables cluster_node_heartbeat to work correctly with both the legacy
and new dispatcher implementations based on the FEATURE_NEW_DISPATCHER flag.

AAP-41775

* refactor(dispatcher): Remove excessive logging from dispatcher implementation

Reduces verbose debugging logs while maintaining essential logging for critical
operations. Preserves:
- Task implementation selection based on feature flag
- Registration success/failure messages
- Critical error reporting

Removed:
- Registry content debugging messages
- Repetitive task diagnostics
- Non-essential information logging

AAP-41775

* fix(dispatcher): Fix shallow copy in dispatcher schedule conversion

This resolves "AttributeError: 'float' object has no attribute 'total_seconds'"
errors when the dispatcher is restarted.

Refs: AAP-41775

* Use IPC mechanism to get running tasks (#15926)
* Allow tasks from tasks
* Fix failure to limit to waiting jobs
* Get job record with lock
* Fix failures in dispatcherd feature branch (#15930)
* Fully handle DispatcherCancel
* Complete rest of preload import work
* Complete dispatcherd integration & job cancellation (AAP-43033) (#15941)
* feat(dispatcher): Implement job cancellation for new dispatcher

Adds feature-flag-aware job cancellation that routes cancel requests to either
the legacy dispatcher or the new dispatcherd library based on the
FEATURE_NEW_DISPATCHER flag.

- Updates cancel_dispatcher_process() to use dispatcherd's control API when enabled
- Handles both direct cancellation and task manager workflow cancellation cases
- Works with DispatcherCancel exception handling to properly handle SIGUSR1 signals

AAP-43033

* fix(dispatcher): Update run_dispatcher.py to properly handle task cancellation

Modifies the cancel command in run_dispatcher.py to properly cancel tasks
when the FEATURE_NEW_DISPATCHER flag is enabled, rather than just listing
running tasks.

The implementation translates each task UUID to the appropriate
filter format expected by the dispatcherd control API, maintaining the same
behavior as the original implementation.

Part of: AAP-43033

* refactor(system): Refactor dispatch_startup() to extract common startup logic and branch based on feature flag

This commit refactors the dispatch_startup() function to improve clarity and consistency across the legacy
and new dispatcherd flows.

No dispatcher-specific functionality is needed beyond the changes made, so this refactoring improves robustness without
altering core behavior.

* refactor(system): Refactor inform_cluster_of_shutdown() for clarity

* refactor(tasks): Replace @task with @task_awx across 22 tasks for dispatcher compatibility

- Migrated all task decorators to use @task_awx, ensuring dispatcher-aware behavior.
- Tested each task with the new dispatcherd, verifying that tasks using the registry pattern execute correctly without needing binder‐based alternative implementations.
- Removed redundant logging and outdated comments.
- Legacy tasks that do not require special parameter extraction continue to use their original logic.
- This commit reflects our complete journey of testing and verifying dispatcherd compatibility across all 22 tasks.

* refactor(publish): fix linter

* Fix bug from the branch rebase

* AAP-43763 Add tests for connection management in dispatcherd workers (#15949)

* Add test for job cancel in live tests
* Fix bug from the branch rebase
* Add test for connection recovery after connection broke
* Add test for breaking connection

* Fix dispatcherd bugs: schedule aliases, job kwargs handling, cancel handling (#15960)

* Put in job kwargs handling, not done before

* AAP-44382 [dispatcherd] Fixes for running with feature flag off (#15973)

* Use correct decorator for test of tasks

* Finalize dispatcherd feature branch (#15975)

* Work dispatcherd into dependency management system

* Use util methods from DAB

* Rename the dispatcherd feature flag, and flip default to not-enabled

* Move to new submit_task method

* Update the location of the sock file

* AAP-44381 Make dispatcherd config loading more lazy (#15979)

* Make dispatcherd config loading more lazy

* Make submission error more obvious

* Fix signal handling gap, hijack SIGUSR1 from dispatcherd (#15983)

* Fix signal handling gap, hijack SIGUSR1 from dispatcherd

* Minor adjustments to dispatcherd status command

* [dispatcherd] Get rid of alternative task registry (#15984)

Get rid of alternative task registry

* Fix deadlock error and other cleanup errors (#15987)
* Move to proper error handling location

---------

Co-authored-by: artem_tiupin <70763601+art-tapin@users.noreply.github.com>
2025-05-16 09:39:22 -04:00
Chris Meyers
ad706d67c2 Add ee cleanup tests
* Adds cleanup tests to the live test.
2025-01-24 10:55:30 -05:00
Alan Rominger
f57a9863d6 Use advisory_lock from DAB (#15676)
* Use advisory_lock from DAB

* Remove the django-pglocks dep

* Re-run updater script

* Move the import in new location
2025-01-15 14:06:59 -05:00
Sasa Jovicic
7835e39bac Bugfix: adjust incorrectly passed keywords with exclude-strings argument (#15721)
* Fix incorrectly passed keywords with exclude-strings arg to ansible-runner worker cleanup command

Signed-off-by: Sasa Jovicic <jovicic.sasa@hotmail.com>

* Keep the quotes for each arg and adjust test_receptor

---------

Signed-off-by: Sasa Jovicic <jovicic.sasa@hotmail.com>
2025-01-02 16:16:14 -05:00
Hao Liu
cb04ad8ef5 Fix receptor work unit release after completion (#15679)
Fix bug introduced by https://github.com/ansible/awx/pull/15392 that cause workunit to NOT be auto released after job completes
2024-12-03 11:42:09 -05:00
Hao Liu
139d8f0ae2 Add RECEPTOR_KEEP_WORK_ON_ERROR setting
If RECEPTOR_KEEP_WORK_ON_ERROR is set to true receptor work unit will not be automatically released

Co-Authored-By: Chris Meyers <chrismeyersfsu@users.noreply.github.com>
2024-07-22 17:02:37 -04:00
Chris Meyers
7571df49d5 Pass --exclude="list of exclude dirs like this"
* Previously, the params were passed without quotes and each directory
  was being interpreted as a seperate command line flag.
* Added some structure around the error messages returned from
  receptorctl so we can more easily decide how to handle each case. For
  example, releasing the cleanup job from receptor doesn't absolutely
  need to succeed because we have a periodic job that does that. In
  fact, that is the thing that is making it fail .. but I digress.
2024-03-28 14:42:08 -04:00
Seth Foster
d1cacf64de Add functional and unit tests
Updated existing tests to support the
ReceptorAddress model

- cannot peer to self
- cannot peer to node that is already peered to me
- cannot peer to node more than once (via 2+ addresses)
- cannot set is_internal True

Other changes:
Change post save signal to only call
schedule_write_receptor_config() when an actual change is detected.

Make functional tests more robust by
checking for specific validation error in the
response.

I.e. instead of just checking for 400, just for 400
and that the error message corresponds to the
validation we are testing for.

Signed-off-by: Seth Foster <fosterbseth@gmail.com>
2024-02-02 10:37:41 -05:00
Seth Foster
c32f234ebb Add peers_from_control_nodes to ReceptorAddress
- write_receptor_config peers to ReceptorAddress entries
that have peers_from_control_nodes enabled

- peers_from_control_nodes and listener_port removed from Instance model

- peers_from_control_nodes added to ReceptorAddress model

- ReceptorAddress is now unique by address and protocol combination

- Write receptor config task is dispatched upon ReceptorAddress creation
or deletion, and when control node is first created

- InstanceLinkSerializer adds a target_address field and has logic
to grab the instance hostname associated with the peered ReceptorAddress

Signed-off-by: Seth Foster <fosterbseth@gmail.com>
2024-02-02 10:37:41 -05:00
Seth Foster
5cb3d3b078 Update receptor conf when address changes
Add post save and post delete hooks to
call write_receptor_config when
a receptor address is added / removed.

Add peers_from_control_nodes to
provision_instance

Signed-off-by: Seth Foster <fosterbseth@gmail.com>
2024-02-02 10:37:41 -05:00
Hao Liu
387e877485 Connect from controlplane node to mesh ingress 2024-02-02 10:37:41 -05:00
Alan Rominger
cdb4f0b7fd Consume job_explanation from runner, fix error reporting error (#13482) 2023-08-30 16:45:50 -04:00
Seth Foster
965127637b Make ip_address read only
Setting a different value for ip_address
and hostname does not work with the current
way we create receptor certs.
2023-08-29 13:06:54 -04:00
Seth Foster
81e06dace2 Add listener_port to provision_instance
API changes
- cannot change peers or enable
peers_from_control_nodes on VM deployments
- allow setting ip_address
- use ip_address over hostname in the generated
group_vars/all.yml
- Drop api/v2/peers endpoint

DB changes
- add ip_address unique constraint, but ignore "" entries

Other changes
- provision_instance should take listener_port option

Tests
- test that new controls doesn't disturb other peers
relationships
- test ip_address over hostname
2023-08-29 13:06:54 -04:00
Seth Foster
2bf6512a8e Do not change link state if Removing
inspect_established_receptor_connections should
not change link state is current state is Removing.

Other changes:
- rename inspect_execution_nodes to inspect_execution_and_hop_nodes
- Default link state is Adding
- Set min listener_port value to 1024
- inspect_established_receptor_connections now
runs as part of cluster_node_heartbeat task
2023-08-29 13:06:54 -04:00
Seth Foster
3bd25c682e Allow setting ip_address for execution nodes 2023-08-29 13:06:54 -04:00
Seth Foster
2a51f23b7d Add functional API tests
add tests for calling write_receptor_config

add write_receptor_config test

Do not set default listener_port on control node
2023-08-29 13:06:54 -04:00
Lorenzo Tanganelli
f7fdb7fe8d Add peers readonly api and instancelink constraint (#13916)
Add Disconnected link state

introspect_receptor_connections is a periodic
task that examines active receptor connections
and cross-checks it with the InstanceLink info.

Any links that should be active but are not
will be put into a Disconnected state. If
active, it will be in an Established state.

UI - Add hop creation and peers mgmt (#13922)

* add UI for mgmt peers, instance edit and add

* add peer info on detail and bug fix on detail

* remove unused chip and change peer label

* rename lookup, put Instance type disable on edit

---------

Co-authored-by: tanganellilore <lorenzo.tanagnelli@hotmail.it>
2023-08-29 13:06:54 -04:00
Seth Foster
d8abd4912b Add support in hop nodes in API 2023-08-29 13:06:54 -04:00
Alan Rominger
6f9ea1892b AAP-14538 Only process ansible_facts for successful jobs (#14313) 2023-08-04 17:10:14 -04:00
Seth Foster
8b49f910c7 Add settings.RECEPTOR_LOG_LEVEL, update work signing key path (#14098) 2023-07-06 11:39:30 -04:00
Seth Foster
af39b2cd3f Rename work signing private key filename (#14156) 2023-06-21 19:50:04 -04:00
Seth Foster
900c4fd8f1 Rename work signing private key filename (#14151) 2023-06-21 09:52:58 -04:00
Hao Liu
b96564da55 Rename/relocate receptor cert and keys (#14091) 2023-06-09 12:57:04 -04:00
Seth Foster
45c13c25a4 Set receptor log level to info (#13958) 2023-05-05 15:01:21 -04:00
Hao Liu
cd3f7666be add get_task_queuename
get_local_queuename will return the pod name of the instance

now that web and task are in different pods when web container queue a task it will be put into a queue without as task worker to execute the task
2023-03-29 22:09:19 -04:00
Seth Foster
db2253601d Allow TLS 1.2 for Receptor connections
- Required for FIPS environment where TLS 1.3 is
not supported
- TLS 1.3 can still be used if the nodes
both agree to use during handshake.
2023-03-27 11:07:30 -04:00
Alan Rominger
d5de1f9d11 Make use of new keepalive messages from ansible-runner
Make setting API configurable and process keepalive events
  when seen in the event callback

Use env var in pod spec and make it specific to K8S
2023-02-28 09:04:07 -05:00
Alan Rominger
f5785976be Update to comply with new black rules 2023-02-01 14:59:38 -05:00
Seth Foster
36c0d07b30 Result_traceback should not include job stdout
If a job fails, we do receptor work results and put that output
into result_traceback.

We should only do this if
1. Receptor unit has failed
2. Runner callback processed 0 events

Otherwise we risk putting too much data into this field.
2022-12-21 13:05:44 -05:00
Alan Rominger
f88b993b18 Fix bug, sign work based signing, not verification 2022-12-06 16:21:17 -05:00
Alan Rominger
84f2b91105 Fix fallout from turning off work signing in docker-compose 2022-11-18 13:25:05 -05:00
Alan Rominger
d6fea77082 Include stdout from health check if it is not nothing 2022-10-26 16:26:59 -04:00
Seth Foster
301807466d Only get receptor.conf lock in k8s environment
- Writing to receptor.conf only takes place in K8S, so only get a
lock if IS_K8S is true
2022-09-23 09:46:15 -04:00
Seth Foster
1fde9c4f0c add firewall rules to control node 2022-09-23 09:46:14 -04:00
Jeff Bradberry
08c18d71bf Move InstanceLink creation and updating to the async tasks
So that they get applied in situations that do not go through the API.
2022-09-23 09:46:14 -04:00
Jeff Bradberry
f3a9d4db07 Assign a default queue to wait_for_jobs() 2022-09-23 09:46:14 -04:00
Jeff Bradberry
ebd200380a Resolve a deadlock in write_receptor_config() 2022-09-23 09:46:13 -04:00
Jeff Bradberry
1b650d6927 When deprovisioning a node, kick off a task that waits on running jobs
After all jobs on the node are complete, delete the node then
broadcast the write_receptor_config task.

Also, make sure that write_receptor_config updates the state of links
that are in 'adding' state.
2022-09-23 09:46:13 -04:00
Jeff Bradberry
3bc86ca8cb Follow up on new execution node creation
- hop nodes are descoped
- links need to be created on execution node creation
- expose the 'edit' capabilities on the instance serializer
2022-09-23 09:46:13 -04:00
Seth Foster
3b024a057f Allow work signing for execution node (#12771)
- work-signing added to the generated receptor config
- During receptor task submission, signwork is True when submitting to
  an execution node
2022-09-23 09:46:13 -04:00
Jeff Bradberry
0465a10df5 Deal with exceptions when running execution_node_health_check (#12733) 2022-09-23 09:46:12 -04:00
Shane McDonald
9b034ad574 generate control node receptor.conf
when a new remote execution/hop node is added
regenerate the receptor.conf for all control node to
peer out to the new remote execution node

Signed-off-by: Hao Liu <haoli@redhat.com>
Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com>
Co-Authored-By: Shane McDonald <me@shanemcd.com>
2022-09-23 09:46:12 -04:00
Alan Rominger
c59bbdecdb Refactor canceling to work through messaging and signals, not database
If canceled attempted before, still allow attempting another cancel
in this case, attempt to send the sigterm signal again.
Keep clicking, you might help!

Replace other cancel_callbacks with sigterm watcher
  adapt special inventory mechanism for this too

Get rid of the cancel_watcher method with exception in main thread

Handle academic case of sigterm race condition

Process cancelation as control signal

Fully connect cancel method and run_dispatcher to control

Never transition workflows directly to canceled, add logs
2022-09-01 15:20:31 -04:00
John Westcott IV
268a4ad32d Modifying reaper of administrative work units to allow for change from Controller to Hybrid nodes (#12614) 2022-08-11 09:03:35 -03:00
Alan Rominger
452744b67e Delay update of artifacts and error fields until final job save (#11832)
* Delay update of artifacts until final job save

Save tracebacks from receptor module to callback object

Move receptor traceback check up to be more logical

Use new mock_me fixture to avoid DB call with me method

Update the special runner message to the delay_update pattern

* Move special runner message into post-processing of callback fields
2022-05-03 14:42:50 -04:00
John Westcott IV
a0ccc8c925 Merge pull request #5784 from ansible/runner_changes_42 (#12083) 2022-04-22 10:46:35 -04:00
Alan Rominger
799bac4066 Merge pull request #11860 from AlanCoding/hybrid_artifacts
Do not remove artifacts for local work
2022-03-18 10:37:06 -04:00