replace celery task decorators with a kombu-based publisher

this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:

- a new decorator, `awx.main.dispatch.task`, which can be used to
  decorate functions or classes so that they can be designated as
  "Tasks"
- support for fanout/broadcast tasks (at this point in time, only
  `conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
  `handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
  on demand
- minimal support for RPC, such as status checks and pool recycle/reload
This commit is contained in:
Ryan Petrello
2018-08-08 13:41:07 -04:00
parent da74f1d01f
commit ff1e8cc356
54 changed files with 1606 additions and 1147 deletions

View File

@@ -102,7 +102,7 @@ available fields.
```
`CopyAPIView` will automatically detect sub objects of an object, and do a deep copy of all sub objects
as a background celery task. There are sometimes permission issues with sub object copy. For example,
as a background task. There are sometimes permission issues with sub object copy. For example,
when copying nodes of a workflow job template, there are cases where the user performing copy has no use
permission of related credential and inventory of some nodes, and it is desired those fields will be
`None`. In order to do that, developer should provide a static method `deep_copy_permission_check_func`

View File

@@ -1,15 +1,15 @@
# Task Manager Overview
The task manager is responsible for deciding when jobs should be introduced to celery for running. When choosing a task to run the considerations are: (1) creation time, (2) job dependency, (3) capacity.
The task manager is responsible for deciding when jobs should scheduled to run. When choosing a task to run the considerations are: (1) creation time, (2) job dependency, (3) capacity.
Independent jobs are ran in order of creation time, earliest first. Jobs with dependencies are also ran in creation time order within the group of job dependencies. Capacity is the final consideration when deciding to release a job to be ran by celery.
Independent jobs are ran in order of creation time, earliest first. Jobs with dependencies are also ran in creation time order within the group of job dependencies. Capacity is the final consideration when deciding to release a job to be ran by the task dispatcher.
## Task Manager Architecture
The task manager has a single entry point, `Scheduler().schedule()`. The method may be called in parallel, at any time, as many times as the user wants. The `schedule()` function tries to aquire a single, global, lock using the Instance table first record in the database. If the lock cannot be aquired the method returns. The failure to aquire the lock indicates that there is another instance currently running `schedule()`.
### Hybrid Scheduler: Periodic + Event
The `schedule()` function is ran (a) periodically by a celery task and (b) on job creation or completion. The task manager system would behave correctly if ran, exclusively, via (a) or (b). We chose to trigger `schedule()` via both mechanisms because of the nice properties I will now mention. (b) reduces the time from launch to running, resulting a better user experience. (a) is a fail-safe in case we miss code-paths, in the present and future, that change the 3 scheduling considerations for which we should call `schedule()` (i.e. adding new nodes to tower changes the capacity, obscure job error handling that fails a job)
The `schedule()` function is ran (a) periodically by a background task and (b) on job creation or completion. The task manager system would behave correctly if ran, exclusively, via (a) or (b). We chose to trigger `schedule()` via both mechanisms because of the nice properties I will now mention. (b) reduces the time from launch to running, resulting a better user experience. (a) is a fail-safe in case we miss code-paths, in the present and future, that change the 3 scheduling considerations for which we should call `schedule()` (i.e. adding new nodes to tower changes the capacity, obscure job error handling that fails a job)
Emperically, the periodic task manager has served us well in the past and we will continue to rely on it with the added event-triggered `schedule()`.
### Scheduler Algorithm
@@ -17,14 +17,14 @@ The `schedule()` function is ran (a) periodically by a celery task and (b) on jo
* Detect finished workflow jobs
* Spawn next workflow jobs if needed
* For each pending jobs; start with oldest created job
* If job is not blocked, and there is capacity in the instance group queue, then mark the as `waiting` and submit the job to celery.
* If job is not blocked, and there is capacity in the instance group queue, then mark the as `waiting` and submit the job to RabbitMQ.
### Job Lifecycle
| Job Status | State |
|:----------:|:------------------------------------------------------------------------------------------------------------------:|
| pending | Job launched. <br>1. Hasn't yet been seen by the scheduler <br>2. Is blocked by another task <br>3. Not enough capacity |
| waiting | Job submitted to celery. |
| running | Job running in celery. |
| waiting | Job published to an AMQP queue.
| running | Job running on a Tower node.
| successful | Job finished with ansible-playbook return code 0. |
| failed | Job finished with ansible-playbook return code other than 0. |
| error | System failure. |

155
docs/tasks.md Normal file
View File

@@ -0,0 +1,155 @@
Background Tasks in AWX
=======================
AWX runs a lot of Python code asynchronously _in the background_ - meaning
_outside_ of the context of an HTTP request, such as:
* Any time a Job is launched in AWX (a Job Template, an Adhoc Command, a Project
Update, an Inventory Update, a System Job), a background process retrieves
metadata _about_ that job from the database and forks some process (e.g.,
`ansible-playbook`, `awx-manage inventory_import`)
* Certain expensive or time consuming tasks run in the background
asynchronously (like deleting an inventory).
* AWX runs a variety of periodic background tasks on a schedule. Some examples
are:
- AWX's "Task Manager/Scheduler" wakes up periodically and looks for
`pending` jobs that have been launched and are ready to start running.
- AWX periodically runs code that looks for scheduled jobs and launches
them.
- AWX runs a variety of periodic tasks that clean up temporary files, and
perform various administrative checks
- Every node in an AWX cluster runs a periodic task that serves as
a heartbeat and capacity check
Tasks, Queues and Workers
----------------
To accomplish this, AWX makes use of a "Task Queue" abstraction. Task Queues
are used as a mechanism to distribute work across machines in an AWX
installation. A Task Queue's input is a unit of work called a Task. Dedicated
worker processes running on every AWX node constantly monitor these queues for
new work to perform.
AWX communicates with these worker processes via AMQP - using RabbitMQ,
specifically - to mediate between clients and workers. To initiate a task, the
client (generally, Python code in the AWX API) publishes a message to a queue,
and RabbitMQ then delivers that message to one or more workers.
Clustered AWX installations consist of multiple workers spread across every
node, giving way to high availability and horizontal scaling.
Direct vs Fanout Messages
-------------------------
AWX publishes tasks in two distinct ways.
*Direct* messages are bound _directly_ to a specific named queue. When you launch
a Job Template in AWX, it looks at the available capacity of the various nodes
in your cluster and chooses an `Execution Node` where the playbook will run.
In this scenario, AWX publishes a message to a direct queue associated with
that AWX node. The dispatcher process running on that AWX node is specifically
bound to _listen_ for new events on their instance-specific queue.
Certain direct queues in AWX are bound to by _every_ AWX node. For example,
when an inventory deletion task is published, any available node across the
entire AWX may perform the work. Under _direct_ exchanges, every published
message is consumed and handled by *one* worker process.
*Fanout* messages are sent out in a broadcast fashion. When you change
a setting value in the AWX API, a fanout message is broadcast to _every_ AWX
node in your cluster, and code runs on _every_ node.
Defining and Running Tasks
--------------------------
Tasks are defined in AWX's source code, and generally live in the
`awx.main.tasks` module. Tasks can be defined as simple functions:
from awx.main.dispatch.publish import task
@task()
def add(a, b):
return a + b
...or classes that define a `run` method:
@task()
class Adder:
def run(self, a, b):
return a + b
To publish a task and run it in the background, use the `apply_async()`
function:
add.apply_async([1, 1])
Adder.apply_async([1, 1])
When you run this function, a JSON message is composed and published to the
appropriate AMQP queue:
{
"uuid": "<some_unique_string>",
"args": [1, 1],
"kwargs": {},
"task": "awx.main.tasks.add"
}
When a background worker receives the message, it deserializes it and runs the
associated Python code:
awx.main.tasks.add(123)
Dispatcher Implementation
-------------------------
Every node in an AWX install runs `awx-manage run_dispatcher`, a Python process
that uses the `kombu` library to consume messages from the appropriate queues
for that node (the default shared queue, a queue specific to the node's
hostname, and the broadcast queue). The Dispatcher process manages a pool of
child processes that it distributes inbound messages to. These worker
processes perform the actual work of deserializing published tasks and running
the associated Python code.
Heartbeats, Capacity, and Job Reaping
------------------------------------
One of the most important tasks in a clustered AWX installation is the periodic
heartbeat task. This task runs periodically on _every_ node, and is used to
record a heartbeat and system capacity for that node (which is used by the
scheduler when determining where to placed launched jobs).
If a node in an AWX cluster discovers that one of its peers has not updated its
heartbeat within a certain grace period, it is assumed to be offline, and its
capacity is set to zero to avoid scheduling new tasks on that node.
Additionally, jobs allegedly running or scheduled to run on that node are
assumed to be lost, and "reaped", or marked as failed.
Debugging
---------
`awx-manage run_dispatcher` includes a few flags that allow interaction and
debugging:
```
[root@awx /]# awx-manage run_dispatcher --status
2018-09-14 18:39:22,223 WARNING awx.main.dispatch checking dispatcher status for awx
awx[pid:9610] workers total=4 min=4 max=60
. worker[pid:9758] sent=12 finished=12 qsize=0 rss=106.730MB [IDLE]
. worker[pid:9769] sent=5 finished=5 qsize=0 rss=105.141MB [IDLE]
. worker[pid:9782] sent=5 finished=4 qsize=1 rss=110.430MB
- running 0c1deb4d-25ae-49a9-804f-a8afd05aff29 RunJob(*[9])
. worker[pid:9787] sent=3 finished=3 qsize=0 rss=101.824MB [IDLE]
```
This outputs running and queued task UUIDs handled by a specific dispatcher
(which corresponds to `main_unifiedjob.celery_task_id` in the database):
```
[root@awx /]# awx-manage run_dispatcher --running
2018-09-14 18:39:22,223 WARNING awx.main.dispatch checking dispatcher running for awx
['eb3b0a83-86da-413d-902a-16d7530a6b25', 'f447266a-23da-42b4-8025-fe379d2db96f']
```
Additionally, you can tell the local running dispatcher to recycle all of the
workers in its pool. It will wait for any running jobs to finish and exit when
work has completed, spinning up replacement workers.
```
awx-manage run_dispatcher --reload
```