the bigint migration removed the foreign key constraints for:
- host_id
- job_id (and projectupdate_id, etc...)
because of this, we don't really need to check explicitly for a host_id
IntegrityError anymore (because it won't occur)
additionally, while it's possible to insert an event with a mismatched
job_id now (for example, you can totally start a long-running job, and
delete the job record in the background using the ORM or psql), doing
so results in DoesNotExist errors in the code that handles the
playbook_on_stats events
instead, just have each worker connect directly to redis
this has a few benefits:
- it's simpler to explain and debug
- back pressure on the queue keeps messages around in redis (which is
observable, and survives the restart of Python processes)
- it's likely notably more performant at high loads
make the --status flag work by fetching a periodically recorded snapshot
of internal process state; additionally, update the callback receiver to
*also* record these statistics so we can gain more insight into any
performance issues
Situations have come up where the 5+ minute kill signal for
run_task_manager is emitted to the worker process running it, but
since the worker improperly inherited the AWXConsumerBase().stop()
handler a deadlock ultimately was triggered on the database
connection.
* Sleep before trying to reconnect
Most common reason for entering this reconnect loop is when Redis
service stops before the callback receiver when stopping tower services.
success/failure notifications for *playbooks* include summary data about
the hosts in based on the contents of the playbook_on_stats event
the current implementation suffers from a number of race conditions that
sometimes can cause that data to be missing or incomplete; this change
makes it so that for *playbooks* we build (and send) the notification in
response to the playbook_on_stats event, not the EOF event
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow
this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`
see: https://github.com/ansible/awx/issues/5514
this _technically_ prevents a remote code exploit where a user who has
access to publish AMQP messages to the dispatch queue could craft
a special message that would import and run arbitrary Python functions;
that said, the types of user with this privilege level are generally
_already_ the awx user (so they can already do this by hand if they
want)
this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:
- a new decorator, `awx.main.dispatch.task`, which can be used to
decorate functions or classes so that they can be designated as
"Tasks"
- support for fanout/broadcast tasks (at this point in time, only
`conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
`handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
on demand
- minimal support for RPC, such as status checks and pool recycle/reload