38 Commits

Author SHA1 Message Date
Ryan Petrello
b744c4ebb7
further optimize callback receiver buffering for certain situations
see: https://github.com/ansible/awx/issues/9085
2021-01-14 17:17:12 -05:00
Chris Meyers
eb47c8dbc6 centralize reusable profiling code 2020-10-27 08:21:41 -04:00
Ryan Petrello
baad765179
refactor some callback receiver code
the bigint migration removed the foreign key constraints for:

- host_id
- job_id (and projectupdate_id, etc...)

because of this, we don't really need to check explicitly for a host_id
IntegrityError anymore (because it won't occur)

additionally, while it's possible to insert an event with a mismatched
job_id now (for example, you can totally start a long-running job, and
delete the job record in the background using the ORM or psql), doing
so results in DoesNotExist errors in the code that handles the
playbook_on_stats events
2020-09-25 13:12:42 -04:00
Ryan Petrello
cd0b9de7b9
remove multiprocessing.Queue usage from the callback receiver
instead, just have each worker connect directly to redis
this has a few benefits:

- it's simpler to explain and debug
- back pressure on the queue keeps messages around in redis (which is
  observable, and survives the restart of Python processes)
- it's likely notably more performant at high loads
2020-09-24 13:53:58 -04:00
Ryan Petrello
57f8e48894
make --status more robust for dispatcher, and add support for receiver
make the --status flag work by fetching a periodically recorded snapshot
of internal process state; additionally, update the callback receiver to
*also* record these statistics so we can gain more insight into any
performance issues
2020-09-17 15:33:37 -04:00
Jeff Bradberry
ced8f42835 Force worker processes to have a different signal handler from the parent
Situations have come up where the 5+ minute kill signal for
run_task_manager is emitted to the worker process running it, but
since the worker improperly inherited the AWXConsumerBase().stop()
handler a deadlock ultimately was triggered on the database
connection.
2020-06-04 15:41:28 -04:00
Ryan Petrello
b4b261b918
fix busted flake8 2020-05-01 13:51:37 -04:00
chris meyers
a8f52c1639 actually do exponential calc rather than *2
* Log the time til reconnect attemp to log message rather than attempt
number
2020-04-28 15:24:08 -04:00
chris meyers
2ecd055d1e sleep backoff on cb receiver reconnect
* Sleep before trying to reconnect
Most common reason for entering this reconnect loop is when Redis
service stops before the callback receiver when stopping tower services.
2020-04-28 12:47:40 -04:00
Christian Adams
a899a147e1 Fix new flake8 from pyflakes 2.2.0 release 2020-04-20 09:50:50 -04:00
Ryan Petrello
80147acc1c
work around redis connection failures in the callback receiver
if redis stops/starts, sometimes the callback receiver doesn't recover
without a restart; this fixes that
2020-04-09 15:38:03 -04:00
Ryan Petrello
c8044b4755
migrate event table primary keys from integer to bigint
see: https://github.com/ansible/awx/issues/6010
2020-03-26 15:54:38 -04:00
Ryan Petrello
d40a5dec8f
change when we send job notifications to avoid a race condition
success/failure notifications for *playbooks* include summary data about
the hosts in based on the contents of the playbook_on_stats event

the current implementation suffers from a number of race conditions that
sometimes can cause that data to be missing or incomplete; this change
makes it so that for *playbooks* we build (and send) the notification in
response to the playbook_on_stats event, not the EOF event
2020-03-19 10:01:52 -04:00
chris meyers
093d204d19
fix flake8 2020-03-18 16:10:19 -04:00
chris meyers
be58906aed
remove kombu 2020-03-18 16:10:17 -04:00
chris meyers
2a2c34f567
combine all the broker replacement pieces
* local redis for event processing
* postgres for message broker
* redis for websockets
2020-03-18 16:10:15 -04:00
chris meyers
558e92806b
POC postgres broker 2020-03-18 16:10:15 -04:00
chris meyers
355fb125cb
redis events 2020-03-18 16:10:15 -04:00
AlanCoding
e59cb07064
Add wording for control message log 2020-02-11 10:01:25 -05:00
Ryan Petrello
3c31e0ed16
some more minor callback cleanup and development tweaks 2020-01-27 17:18:09 -05:00
Ryan Petrello
78b00652bd
add the ability to enable profiling for the callback receiver workers 2020-01-27 12:03:53 -05:00
Bill Nottingham
4e46d5d7cd Fix some lint 2020-01-20 17:15:27 -05:00
Ryan Petrello
8bd9233d2c
remove some unnecessary callback receiver debugging code 2020-01-14 14:21:53 -05:00
Ryan Petrello
306f504fb7
optimize the callback receiver to buffer writes on high throughput
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: https://github.com/ansible/awx/issues/5514
2020-01-14 12:04:26 -05:00
Ryan Petrello
3094b67664
work around a bug in the k8s client that leaves trash in /tmp 2019-10-29 11:24:17 -04:00
Ryan Petrello
d01088d33e
Revert "add support for awx-manage run_callback_receiver --status" 2019-10-18 09:49:02 -04:00
Ryan Petrello
ffb1707e74
add support for awx-manage run_callback_receiver --status 2019-10-17 11:10:27 -04:00
Ryan Petrello
17a803f49c
remove the old callback plugin import paths and callback-specific tests 2019-04-12 16:11:23 -04:00
Ryan Petrello
32ee9838af
use the correct logger for the callback receiver
the callback receiver and dispatcher share several modules, so add logic
to use the correct logger
2019-03-15 08:09:47 -04:00
Ryan Petrello
daeeaf413a
clean up unnecessary usage of the six library (awx only supports py3) 2019-01-25 00:19:48 -05:00
Ryan Petrello
4707dc2a05
clean up some unnecessary dispatcher reaping code 2019-01-24 11:11:05 -05:00
Ryan Petrello
f223df303f
convert py2 -> py3 2019-01-15 14:09:01 -05:00
Ryan Petrello
5950f26c69
only allow the task dispatch worker to import and run decorated tasks
this _technically_ prevents a remote code exploit where a user who has
access to publish AMQP messages to the dispatch queue could craft
a special message that would import and run arbitrary Python functions;
that said, the types of user with this privilege level are generally
_already_ the awx user (so they can already do this by hand if they
want)
2018-12-12 17:46:41 -05:00
Ryan Petrello
0391dbc292
add additional DB retry logic to the callback receiver
initially, I implemented this for _only_ the task worker, but it's
probably needed for callback event workers, too
2018-11-29 11:57:46 -05:00
AlanCoding
482395eb6a
reduce default verbosity of devel-specific callback logging 2018-10-26 10:03:46 -04:00
Ryan Petrello
0d29bbfdc6
make the dispatcher more fault-tolerant to prolonged database outages 2018-10-18 20:00:07 -04:00
Ryan Petrello
53ae05094e
use the proper logger for the callback receiver 2018-10-17 10:56:29 -04:00
Ryan Petrello
ff1e8cc356
replace celery task decorators with a kombu-based publisher
this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:

- a new decorator, `awx.main.dispatch.task`, which can be used to
  decorate functions or classes so that they can be designated as
  "Tasks"
- support for fanout/broadcast tasks (at this point in time, only
  `conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
  `handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
  on demand
- minimal support for RPC, such as status checks and pool recycle/reload
2018-10-11 10:53:30 -04:00