* Enable new fancy asyncio metrics for dispatcherd
Remove old dispatcher metrics and patch in new data from local whatever
Update test fixture to new dispatcherd version
* Update dispatcherd again
* Handle node filter in URL, and catch more errors
* Add test for metric filter
* Split module for dispatcherd metrics
* Additional dispatcher removal simplifications and waiting repear updates
* Fix double call and logging message
* Implement bugbot comment, should reap running on lost instances
* Add test case for new pending behavior
* WIP First pass
* started removing feature flags and adjusting logic
* Add decorator
* moved to dispatcher decorator
* updated as many as I could find
* Keep callback receiver working
* remove any code that is not used by the call back receiver
* add back auto_max_workers
* added back get_auto_max_workers into common utils
* Remove control and hazmat (squash this not done)
* moved status out and deleted control as no longer needed
* removed unused imports
* adjusted test import to pull correct method
* fixed imports and addressed clusternode heartbeat test
* Update function comments
* Add back hazmat for config and remove baseworker
* added back hazmat per @alancoding feedback around config
* removed baseworker completely and refactored it into the callback
worker
* Fix dispatcher run call and remove dispatch setting
* remove dispatcher mock publish setting
* Adjust heartbeat arg and more formatting
* fixed the call to cluster_node_heartbeat missing binder
* Fix attribute error in server logs
* update to Python 3.12
* remove use of utcnow
* switch to timezone.utc
datetime.UTC is an alias of datetime.timezone.utc. if we're doing the double import for datetime it's more straightforward to just import timezone as well and get it directly
* debug python env version issue
* change python version
* pin to SHA and remove debug portion
Upgrade to Django 5.2 LTS with compatibility fixes across fields, migrations, dispatch config, tests, and dev deps.
Dependencies:
- Upgrade django to 5.2.8 and relax requirements.in to >=5.2,<5.3.
- Bump django-debug-toolbar to >=6.0 for compatibility.
Backend:
- awx/conf/fields.py: switch URL TLD regex to use DomainNameValidator.ul in custom URLField.
- awx/main/management/commands/gather_analytics.py: use datetime.timezone.utc for naïve datetime handling.
- awx/main/dispatch/config.py: add mock_publish option; avoid DB access for test runs, set default max_workers, and support a noop broker.
Migrations (SQLite/Postgres compatibility):
- Add awx/main/migrations/_sqlite_helper.py with db-aware AlterIndexTogether/RenameIndex wrappers; consume in 0144_event_partitions.py and 0184_django_indexes.py.
- Update 0187_hop_nodes.py to use CheckConstraint(condition=...).
- Add 0205_alter_instance_peers_alter_job_hosts_and_more.py adjusting through_fields/relations on instance.peers, job.hosts, and role.ancestors.
- _dab_rbac.py: iterate roles with chunk_size=1000 for migration performance.
Tests:
Include hcp_terraform in default credential types in test_credential.py.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Alan Rominger <arominge@redhat.com>
* AAP-57817 Add Redis connection retry using redis-py 7.0+ built-in mechanism
* Refactor Redis client helpers to use settings and eliminate code duplication
* Create awx/main/utils/redis.py and move Redis client functions to avoid circular imports
* Fix subsystem_metrics to share Redis connection pool between
client and pipeline
* Cache Redis clients in RelayConsumer and RelayWebsocketStatsManager to avoid creating new connection pools on every call
* Add cap and base config
* Add Redis retry logic with exponential backoff to handle connection failures during long-running operations
* Add REDIS_BACKOFF_CAP and REDIS_BACKOFF_BASE settings to allow
adjustment of retry timing in worst-case scenarios without code changes
* Simplify Redis retry tests by removing unnecessary reload logic
Modify the invocation of @task_awx to accept timeout and
on_duplicate keyword arguments. These arguments are
only used in the new dispatcher implementation.
Add decorator params:
- timeout
- on_duplicate
to tasks to ensure better recovery for
stuck or long-running processes.
---------
Signed-off-by: Seth Foster <fosterbseth@gmail.com>
* Fix bug where collectstatic could error due to dispatcherd config
* Revert test because it will not work in test suite
* New publish mocking system
* Remove import of unused
* Fix default publish broker
Use dynamic AWX max_workers value
Make basic --status and --running commands work
Make feature flag enabled true by default for development
* [dispatcherd] Dispatcher socket-based `--status` demo working (#15908)
* Fix Task Decorator to Work With and Without Feature Flag (AAP-41775) (#15911)
* refactor(system): extract common heartbeat helpers and split cluster_node_heartbeat
Extract common heartbeat logic into helper functions: _heartbeat_instance_management: consolidates instance management, health checks, and lost-instance detection. _heartbeat_check_versions: compares instance versions and initiates shutdown when necessary. _heartbeat_handle_lost_instances: reaps jobs and marks lost instances offline.
Refactor the original cluster_node_heartbeat to use these helpers and retain legacy behavior (using bind_kwargs).
Introduce adispatch_cluster_node_heartbeat for dispatcherd: uses the control API to retrieve running tasks and reaps them.
Link the two implementations by attaching adispatch_cluster_node_heartbeat as the _new_method on cluster_node_heartbeat.
* feat(publish): delegate heartbeat task submission to new dispatcherd implementation
Update apply_async to check at runtime if FEATURE_NEW_DISPATCHER is enabled.
When the task is cluster_node_heartbeat and a _new_method is attached, delegate the task submission to the new dispatcherd implementation.
Preserve the original behavior for all other tasks and fallback on error.
* refactor(system): extract task ID retrieval from dispatcherd into helper function
Improves readability of adispatch_cluster_node_heartbeat by extracting
the complex UUID parsing logic into a dedicated helper function.
Adds clearer error handling and follows established code patterns.
* fix(dispatcher): Enable task decorator to work with and without feature flag
Implemented a new approach for handling task execution with feature flags
by attaching alternative implementations to apply_async._new_method. This
allows cluster_node_heartbeat to work correctly with both the legacy and
new dispatcher systems without modifying core decorator logic.
AAP-41775
* fix(dispatcher): Improve error handling and logging in feature flag implementation
- Add error handling when attaching alternative dispatcher implementation
- Fix method self-reference in apply_async to properly use cls.apply_async
- Document limitations of this targeted approach for specific tasks
- Add logging for better debugging of dispatcher selection
- Ensure decorator timing by keeping method attachment after function definitions
This completes the robust implementation for switching between dispatcher
implementations based on feature flags.
AAP-41775
* fix(dispatcher): Implement registry pattern for dispatcher feature flag compatibility
Replaces direct method attribute assignment with a global registry for
alternative implementations. The original approach tried to attach new
methods directly to apply_async bound methods, which fails because bound
methods don't support attribute assignment in Python.
The registry pattern:
- Creates a global ALTERNATIVE_TASK_IMPLEMENTATIONS dict in publish.py
- Registers alternative implementations by task name
- Modifies apply_async to check the registry when feature flag is enabled
- Adds extensive logging throughout the process for debugging
This enables cluster_node_heartbeat to work correctly with both the legacy
and new dispatcher implementations based on the FEATURE_NEW_DISPATCHER flag.
AAP-41775
* refactor(dispatcher): Remove excessive logging from dispatcher implementation
Reduces verbose debugging logs while maintaining essential logging for critical
operations. Preserves:
- Task implementation selection based on feature flag
- Registration success/failure messages
- Critical error reporting
Removed:
- Registry content debugging messages
- Repetitive task diagnostics
- Non-essential information logging
AAP-41775
* fix(dispatcher): Fix shallow copy in dispatcher schedule conversion
This resolves "AttributeError: 'float' object has no attribute 'total_seconds'"
errors when the dispatcher is restarted.
Refs: AAP-41775
* Use IPC mechanism to get running tasks (#15926)
* Allow tasks from tasks
* Fix failure to limit to waiting jobs
* Get job record with lock
* Fix failures in dispatcherd feature branch (#15930)
* Fully handle DispatcherCancel
* Complete rest of preload import work
* Complete dispatcherd integration & job cancellation (AAP-43033) (#15941)
* feat(dispatcher): Implement job cancellation for new dispatcher
Adds feature-flag-aware job cancellation that routes cancel requests to either
the legacy dispatcher or the new dispatcherd library based on the
FEATURE_NEW_DISPATCHER flag.
- Updates cancel_dispatcher_process() to use dispatcherd's control API when enabled
- Handles both direct cancellation and task manager workflow cancellation cases
- Works with DispatcherCancel exception handling to properly handle SIGUSR1 signals
AAP-43033
* fix(dispatcher): Update run_dispatcher.py to properly handle task cancellation
Modifies the cancel command in run_dispatcher.py to properly cancel tasks
when the FEATURE_NEW_DISPATCHER flag is enabled, rather than just listing
running tasks.
The implementation translates each task UUID to the appropriate
filter format expected by the dispatcherd control API, maintaining the same
behavior as the original implementation.
Part of: AAP-43033
* refactor(system): Refactor dispatch_startup() to extract common startup logic and branch based on feature flag
This commit refactors the dispatch_startup() function to improve clarity and consistency across the legacy
and new dispatcherd flows.
No dispatcher-specific functionality is needed beyond the changes made, so this refactoring improves robustness without
altering core behavior.
* refactor(system): Refactor inform_cluster_of_shutdown() for clarity
* refactor(tasks): Replace @task with @task_awx across 22 tasks for dispatcher compatibility
- Migrated all task decorators to use @task_awx, ensuring dispatcher-aware behavior.
- Tested each task with the new dispatcherd, verifying that tasks using the registry pattern execute correctly without needing binder‐based alternative implementations.
- Removed redundant logging and outdated comments.
- Legacy tasks that do not require special parameter extraction continue to use their original logic.
- This commit reflects our complete journey of testing and verifying dispatcherd compatibility across all 22 tasks.
* refactor(publish): fix linter
* Fix bug from the branch rebase
* AAP-43763 Add tests for connection management in dispatcherd workers (#15949)
* Add test for job cancel in live tests
* Fix bug from the branch rebase
* Add test for connection recovery after connection broke
* Add test for breaking connection
* Fix dispatcherd bugs: schedule aliases, job kwargs handling, cancel handling (#15960)
* Put in job kwargs handling, not done before
* AAP-44382 [dispatcherd] Fixes for running with feature flag off (#15973)
* Use correct decorator for test of tasks
* Finalize dispatcherd feature branch (#15975)
* Work dispatcherd into dependency management system
* Use util methods from DAB
* Rename the dispatcherd feature flag, and flip default to not-enabled
* Move to new submit_task method
* Update the location of the sock file
* AAP-44381 Make dispatcherd config loading more lazy (#15979)
* Make dispatcherd config loading more lazy
* Make submission error more obvious
* Fix signal handling gap, hijack SIGUSR1 from dispatcherd (#15983)
* Fix signal handling gap, hijack SIGUSR1 from dispatcherd
* Minor adjustments to dispatcherd status command
* [dispatcherd] Get rid of alternative task registry (#15984)
Get rid of alternative task registry
* Fix deadlock error and other cleanup errors (#15987)
* Move to proper error handling location
---------
Co-authored-by: artem_tiupin <70763601+art-tapin@users.noreply.github.com>
Retire workers after a certain age, allowing them to finish their
current task if they are not idle.
This mitigates any issues like memory leaks in long running workers,
especially if systems stay busy for months at a time.
Introduce new optional setting WORKER_MAX_LIFETIME_SECONDS, defaulting to 4 hours.
* Dump running tasks when running out of capacity
* Use same logic for max_workers and capacity
* Address case where CPU capacity is the constraint
* Add a test for correspondence
* Fake redis to make tests work
* AAP-37282 Add parse JQ data and test it for a `job` object in isolation (#15774)
* Add jq dependency
* Add file in progress
* Add license for jq
* Write test and get it passing
* Successfully test collection of `event_query.yml` data (#15761)
* Callback plugin method from cmeyers adapted to global collection list
Get tests passing
Mild rebranding
Put behind feature flag, flip true in dev
Add noqa flag
* Add missing wait_for_events
* feat: try grabbing query files from artifacts directory (#15776)
* Contract changes for the event_query collection callback plugin (#15785)
* Minor import changes to collection processing in callback plugin
* Move agreed location of event_query file
* feat: remaining schema changes for indirect host audits (#15787)
* Re-organize test file and move artifacts processing logic to callback (#15784)
* Rename the indirect host counting test file
* Combine artifacts saving logic
* Connect host audit model to jq logic via new task
* Add unit tests for indirect host counting (#15792)
* Do not get django flags from database (#15794)
* Document, implement, and test remaining indirect host audit fields (#15796)
* Document, implement, and test remaining indirect host audit fields
* Fix hashing
* AAP-39559 Wait for all event processing to finish, add fallback task (#15798)
* Wait for all event processing to finish, add fallback task
* Add flag check to periodic task
* feat: cleanup of old indirect host audit records (#15800)
* By default, do not count indirect hosts (#15801)
* By default, do not count indirect hosts
* Fix copy paste goof
* Fix linter issue from base branch
* prevent multiple tasks from processing the same job events, prevent p… (#15805)
prevent multiple tasks from processing the same job events, prevent periodic task from spawning another task per job
* Fix typos and other bugs found by Pablo review
* fix: rely on resolved_action instead of task, adapt to proposed query… (#15815)
* fix: rely on resolved_action instead of task, adapt to proposed query structure
* tests: update indirect host tests
* update remaining queries to new format
* update live test
* Remove polling loop for job finishing event processing (#15811)
* Remove polling loop for job finishing event processing
* Make awx/main/tests/live dramatically faster (#15780)
* AAP-37282 Add parse JQ data and test it for a `job` object in isolation (#15774)
* Add jq dependency
* Add file in progress
* Add license for jq
* Write test and get it passing
* Successfully test collection of `event_query.yml` data (#15761)
* Callback plugin method from cmeyers adapted to global collection list
Get tests passing
Mild rebranding
Put behind feature flag, flip true in dev
Add noqa flag
* Add missing wait_for_events
* feat: try grabbing query files from artifacts directory (#15776)
* Contract changes for the event_query collection callback plugin (#15785)
* Minor import changes to collection processing in callback plugin
* Move agreed location of event_query file
* feat: remaining schema changes for indirect host audits (#15787)
* Re-organize test file and move artifacts processing logic to callback (#15784)
* Rename the indirect host counting test file
* Combine artifacts saving logic
* Connect host audit model to jq logic via new task
* Document, implement, and test remaining indirect host audit fields (#15796)
* Document, implement, and test remaining indirect host audit fields
* Fix hashing
* AAP-39559 Wait for all event processing to finish, add fallback task (#15798)
* Wait for all event processing to finish, add fallback task
* Add flag check to periodic task
* feat: cleanup of old indirect host audit records (#15800)
* prevent multiple tasks from processing the same job events, prevent p… (#15805)
prevent multiple tasks from processing the same job events, prevent periodic task from spawning another task per job
* Remove polling loop for job finishing event processing (#15811)
* Remove polling loop for job finishing event processing
* Make awx/main/tests/live dramatically faster (#15780)
* temp
* remove test
* reorder migrations to allow indirect instances backport
* cleanup for rebase and merge into devel
---------
Co-authored-by: Peter Braun <pbraun@redhat.com>
Co-authored-by: jessicamack <jmack@redhat.com>
Co-authored-by: Peter Braun <pbranu@redhat.com>
* General upgrade of dependencies
* adjust licenses to match requirements
* add missing licenses
* another pass to fix licenses
* Try easy for for psycopg encoding pattern change
---------
Co-authored-by: jessicamack <jmack@redhat.com>
Fix deadlock scenario where dispatcher child process stuck in reading from queue loop after dispatcher parent process decided to quit
Co-authored-by: Alan Rominger <arominge@redhat.com>
* Organize metrics into their respective service
* Server per-service metrics on a per-service http server
* Increase prometheus client usage over our custom metrics fields
This fixes a bug where jobs within a workflow job were not canceled
when the workflow job was canceled by the user
The fix is to submit the cancel request as a part of the
transaction that WorkflowManager commits its work in
this requires that we send the message without expecting a reply
so this changes the control-with-reply cancel to just a control function
Dispatcher refactoring to get pg_notify publish payload
as separate method
Refactor periodic module under dispatcher entirely
Use real numbers for schedule reference time
Run based on due_to_run method
Review comments about naming and code comments
This fixes https://github.com/ansible/awx/issues/14245 which has
more information about this issue.
This change addresses both:
- A clashing signal handler (registering a callback to fire when
the task manager times out, and hitting that callback in cases
where we didn't expect to). Make dispatcher timeout use
SIGUSR1, not SIGTERM.
- Metrics not being reported should not make us crash, so that is
now fixed as well.
Signed-off-by: Rick Elrod <rick@elrod.me>
Co-authored-by: Alan Rominger <arominge@redhat.com>
NUL characters are not allowed in text fields in the database
We used to strip them out of stdout but the exception changed
And we want to be sure to strip them out of JSONBlob fields
This is expected to free up 4 additional database connections per traditional node
compare to roughly 12 in total before this change
Out of these 3 are accomplished by using existing connection for recently added services
then 1 is obtained by closing the connection for the idle callback receiver main process
Signed-off-by: jessicamack <jmack@redhat.com>
Co-authored-by: jessicamack <jmack@redhat.com>
This adds a handful of metrics to /api/v2/metrics/ recorded from the dispatcher main process
Adds logic in the dispatcher period tasks to calculate these for the last collection interval
Reports worker count, task count, scale up events, and availability
Add data to demo grafana dashboard
* Introduce new method in settings, import in-line w NOQA mark
* Further refine the app_name to use shorter service names like dispatcher
* Clean up listener logic, change some names
get_local_queuename will return the pod name of the instance
now that web and task are in different pods when web container queue a task it will be put into a queue without as task worker to execute the task
* Fix race with heartbeat and reaper logic
* Fix tests to fail when over drift over heartbeat time
* replaced modified with started time for reap() code and added test
* fixed logic bug and cleaned up tests
* Added comments to tests to call out reasoning
* Workaround for events with NUL char, touch up error loop
This fixes an error where some events would not save
due to having the 0x00 character which errors in postgres
this adds a line to replace it with empty text
Hitting that kind of event put us in an infinite error loop
so this change makes a number of changes to prevent similar loops
the showcase example is a negative counter,
this is not realistic in the real world but works for unit tests
These error loop fixes seek to esablish the cases where we clear the buffer
Some logic is removed from the outer loop, with the idea that
ensure_connection will better distinguish flake
* From review comments, delay NUL char sanitization to later
Use pop to make list operations more clear
* Fix incorrect use of pop