This fixes a bug where jobs within a workflow job were not canceled
when the workflow job was canceled by the user
The fix is to submit the cancel request as a part of the
transaction that WorkflowManager commits its work in
this requires that we send the message without expecting a reply
so this changes the control-with-reply cancel to just a control function
Dispatcher refactoring to get pg_notify publish payload
as separate method
Refactor periodic module under dispatcher entirely
Use real numbers for schedule reference time
Run based on due_to_run method
Review comments about naming and code comments
get_local_queuename will return the pod name of the instance
now that web and task are in different pods when web container queue a task it will be put into a queue without as task worker to execute the task
If canceled attempted before, still allow attempting another cancel
in this case, attempt to send the sigterm signal again.
Keep clicking, you might help!
Replace other cancel_callbacks with sigterm watcher
adapt special inventory mechanism for this too
Get rid of the cancel_watcher method with exception in main thread
Handle academic case of sigterm race condition
Process cancelation as control signal
Fully connect cancel method and run_dispatcher to control
Never transition workflows directly to canceled, add logs
instead, just have each worker connect directly to redis
this has a few benefits:
- it's simpler to explain and debug
- back pressure on the queue keeps messages around in redis (which is
observable, and survives the restart of Python processes)
- it's likely notably more performant at high loads
make the --status flag work by fetching a periodically recorded snapshot
of internal process state; additionally, update the callback receiver to
*also* record these statistics so we can gain more insight into any
performance issues
* postgres notify/listen channel names have size limitations as well as
character limitations. Respect those limitations while at the same time
generate a unique channel name.
1. Install awx w/ a single node.
2. Start a long-running job.
3. Forcibly kill the `awx-manage run_dispatcher` process (e.g.,
SIGKILL) and do not start it again.
4. The job remains in running - without a second cluster to discover
the job, it is never reaped.
5. This PR allows you to cancel the job from the UI+API.
this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:
- a new decorator, `awx.main.dispatch.task`, which can be used to
decorate functions or classes so that they can be designated as
"Tasks"
- support for fanout/broadcast tasks (at this point in time, only
`conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
`handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
on demand
- minimal support for RPC, such as status checks and pool recycle/reload