mirror of
https://github.com/ansible/awx.git
synced 2026-03-18 01:17:35 -02:30
replace celery task decorators with a kombu-based publisher
this commit implements the bulk of `awx-manage run_dispatcher`, a new command that binds to RabbitMQ via kombu and balances messages across a pool of workers that are similar to celeryd workers in spirit. Specifically, this includes: - a new decorator, `awx.main.dispatch.task`, which can be used to decorate functions or classes so that they can be designated as "Tasks" - support for fanout/broadcast tasks (at this point in time, only `conf.Setting` memcached flushes use this functionality) - support for job reaping - support for success/failure hooks for job runs (i.e., `handle_work_success` and `handle_work_error`) - support for auto scaling worker pool that scale processes up and down on demand - minimal support for RPC, such as status checks and pool recycle/reload
This commit is contained in:
@@ -1,15 +1,15 @@
|
||||
# Task Manager Overview
|
||||
|
||||
The task manager is responsible for deciding when jobs should be introduced to celery for running. When choosing a task to run the considerations are: (1) creation time, (2) job dependency, (3) capacity.
|
||||
The task manager is responsible for deciding when jobs should scheduled to run. When choosing a task to run the considerations are: (1) creation time, (2) job dependency, (3) capacity.
|
||||
|
||||
Independent jobs are ran in order of creation time, earliest first. Jobs with dependencies are also ran in creation time order within the group of job dependencies. Capacity is the final consideration when deciding to release a job to be ran by celery.
|
||||
Independent jobs are ran in order of creation time, earliest first. Jobs with dependencies are also ran in creation time order within the group of job dependencies. Capacity is the final consideration when deciding to release a job to be ran by the task dispatcher.
|
||||
|
||||
## Task Manager Architecture
|
||||
|
||||
The task manager has a single entry point, `Scheduler().schedule()`. The method may be called in parallel, at any time, as many times as the user wants. The `schedule()` function tries to aquire a single, global, lock using the Instance table first record in the database. If the lock cannot be aquired the method returns. The failure to aquire the lock indicates that there is another instance currently running `schedule()`.
|
||||
|
||||
### Hybrid Scheduler: Periodic + Event
|
||||
The `schedule()` function is ran (a) periodically by a celery task and (b) on job creation or completion. The task manager system would behave correctly if ran, exclusively, via (a) or (b). We chose to trigger `schedule()` via both mechanisms because of the nice properties I will now mention. (b) reduces the time from launch to running, resulting a better user experience. (a) is a fail-safe in case we miss code-paths, in the present and future, that change the 3 scheduling considerations for which we should call `schedule()` (i.e. adding new nodes to tower changes the capacity, obscure job error handling that fails a job)
|
||||
The `schedule()` function is ran (a) periodically by a background task and (b) on job creation or completion. The task manager system would behave correctly if ran, exclusively, via (a) or (b). We chose to trigger `schedule()` via both mechanisms because of the nice properties I will now mention. (b) reduces the time from launch to running, resulting a better user experience. (a) is a fail-safe in case we miss code-paths, in the present and future, that change the 3 scheduling considerations for which we should call `schedule()` (i.e. adding new nodes to tower changes the capacity, obscure job error handling that fails a job)
|
||||
Emperically, the periodic task manager has served us well in the past and we will continue to rely on it with the added event-triggered `schedule()`.
|
||||
|
||||
### Scheduler Algorithm
|
||||
@@ -17,14 +17,14 @@ The `schedule()` function is ran (a) periodically by a celery task and (b) on jo
|
||||
* Detect finished workflow jobs
|
||||
* Spawn next workflow jobs if needed
|
||||
* For each pending jobs; start with oldest created job
|
||||
* If job is not blocked, and there is capacity in the instance group queue, then mark the as `waiting` and submit the job to celery.
|
||||
* If job is not blocked, and there is capacity in the instance group queue, then mark the as `waiting` and submit the job to RabbitMQ.
|
||||
|
||||
### Job Lifecycle
|
||||
| Job Status | State |
|
||||
|:----------:|:------------------------------------------------------------------------------------------------------------------:|
|
||||
| pending | Job launched. <br>1. Hasn't yet been seen by the scheduler <br>2. Is blocked by another task <br>3. Not enough capacity |
|
||||
| waiting | Job submitted to celery. |
|
||||
| running | Job running in celery. |
|
||||
| waiting | Job published to an AMQP queue.
|
||||
| running | Job running on a Tower node.
|
||||
| successful | Job finished with ansible-playbook return code 0. |
|
||||
| failed | Job finished with ansible-playbook return code other than 0. |
|
||||
| error | System failure. |
|
||||
|
||||
Reference in New Issue
Block a user