Commit Graph

30 Commits

Author SHA1 Message Date
Alan Rominger
998000bfbe Surface correct error from bulk_create on unrecoverable error 2022-08-10 16:16:57 -04:00
Alan Rominger
43a50cc62c Fix event counting in error handling path 2022-08-10 16:16:57 -04:00
Alan Rominger
30f556f845 Further resiliency changes focused on offline database
Make logs from database outage more manageable

Raise exception if update_model never recovers from problem
2022-08-10 16:16:57 -04:00
Rebeccah
5f9326b131 added average event processing metric (in seconds) that can be served to
grafana via prometheus.

This metric is a good indicator of how far behind the callback receiver
is. The higher the load the further behind/the greater the number of
seconds the metric will display.

This number being high may indicate the need for horizontal scaling in
the control plane or vertically scaling the number of callback
receivers.
2022-06-06 15:14:56 -04:00
Alan Rominger
1e6ca01686 Fix the callback receiver --status command 2022-05-19 15:00:49 -04:00
Alan Rominger
29d60844a8 Fix notification timing issue by sending in the latter of 2 events (#12110)
* Track host_status_counts and use that to process notifications

* Remove now unused setting

* Back out changes to callback class not needed after all

* Skirt the need for duck typing by leaning on the cached field

* Delete tests for deleted task

* Revert "Back out changes to callback class not needed after all"

This reverts commit 3b8ae350d218991d42bffd65ce4baac6f41926b2.

* Directly hardcode stats_event_type for callback class

* Fire notifications if stats event was never sent

* Remove test content for deleted methods

* Add placeholder for when no hosts matched

* Make field default be None, denote events processed with empty dict

* Make UI process null value for host_status_counts

* Fix tracking of EOF dispatch for system jobs

* Reorganize EVENT_MAP into class properties

* Consolidate conditional I missed from EVENT_MAP refactor

* Give up on the null condition, also applies for empty hosts

* Remove cls position argument not being used

* Move wrapup method out of class, add tests
2022-04-29 13:54:31 -04:00
Jeff Bradberry
1803c5bdb4 Fix up usage of django-guid
It has replaced the class-based middleware, everything is
function-based now.
2022-03-14 13:19:57 -04:00
Seth Foster
6db7cea148 variable name changes 2022-02-10 10:57:00 -05:00
Seth Foster
3993aa9524 Add metric for number of events emitted over websocket broadcast 2022-02-09 21:57:01 -05:00
Amol Gautam
a4a3ba65d7 Refactored tasks.py to a package
--- Added 3 new sub-package : awx.main.tasks.system , awx.main.tasks.jobs , awx.main.tasks.receptor
--- Modified the functional tests and unit tests accordingly
2022-01-14 11:55:41 -05:00
Alan Rominger
210d5084f0 Move skip flag up from event_data and pop it off 2021-06-08 13:33:54 -04:00
Alan Rominger
b551608f16 Move websocket skip logic into event_handler 2021-06-08 13:33:22 -04:00
Seth Foster
0c569c67fd Add subsystem metrics
- Adds a Metrics() class that can track data such as number of
events the callback receiver inserted into database
- Exposes this metric data at the api/v2/metrics/ endpoint.
This data is prometheus-friendly
- Metric data is stored in memory, then periodically saved to Redis.
- Metric data is periodically broadcast to other nodes in the cluster,
so that each node has a copy of the most recent metric data collected.
2021-03-25 15:23:52 -04:00
Ryan Petrello
c2ef0a6500 move code linting to a stricter pep8-esque auto-formatting tool, black 2021-03-23 09:39:58 -04:00
Ryan Petrello
3cc3cf1f80 add a per-request GUID and log as it travels through background services
see: https://github.com/ansible/awx/issues/9329
2021-02-17 12:54:13 -05:00
Ryan Petrello
b744c4ebb7 further optimize callback receiver buffering for certain situations
see: https://github.com/ansible/awx/issues/9085
2021-01-14 17:17:12 -05:00
Chris Meyers
eb47c8dbc6 centralize reusable profiling code 2020-10-27 08:21:41 -04:00
Ryan Petrello
baad765179 refactor some callback receiver code
the bigint migration removed the foreign key constraints for:

- host_id
- job_id (and projectupdate_id, etc...)

because of this, we don't really need to check explicitly for a host_id
IntegrityError anymore (because it won't occur)

additionally, while it's possible to insert an event with a mismatched
job_id now (for example, you can totally start a long-running job, and
delete the job record in the background using the ORM or psql), doing
so results in DoesNotExist errors in the code that handles the
playbook_on_stats events
2020-09-25 13:12:42 -04:00
Ryan Petrello
cd0b9de7b9 remove multiprocessing.Queue usage from the callback receiver
instead, just have each worker connect directly to redis
this has a few benefits:

- it's simpler to explain and debug
- back pressure on the queue keeps messages around in redis (which is
  observable, and survives the restart of Python processes)
- it's likely notably more performant at high loads
2020-09-24 13:53:58 -04:00
Christian Adams
a899a147e1 Fix new flake8 from pyflakes 2.2.0 release 2020-04-20 09:50:50 -04:00
Ryan Petrello
d40a5dec8f change when we send job notifications to avoid a race condition
success/failure notifications for *playbooks* include summary data about
the hosts in based on the contents of the playbook_on_stats event

the current implementation suffers from a number of race conditions that
sometimes can cause that data to be missing or incomplete; this change
makes it so that for *playbooks* we build (and send) the notification in
response to the playbook_on_stats event, not the EOF event
2020-03-19 10:01:52 -04:00
Ryan Petrello
78b00652bd add the ability to enable profiling for the callback receiver workers 2020-01-27 12:03:53 -05:00
Bill Nottingham
4e46d5d7cd Fix some lint 2020-01-20 17:15:27 -05:00
Ryan Petrello
8bd9233d2c remove some unnecessary callback receiver debugging code 2020-01-14 14:21:53 -05:00
Ryan Petrello
306f504fb7 optimize the callback receiver to buffer writes on high throughput
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: https://github.com/ansible/awx/issues/5514
2020-01-14 12:04:26 -05:00
Ryan Petrello
17a803f49c remove the old callback plugin import paths and callback-specific tests 2019-04-12 16:11:23 -04:00
Ryan Petrello
0391dbc292 add additional DB retry logic to the callback receiver
initially, I implemented this for _only_ the task worker, but it's
probably needed for callback event workers, too
2018-11-29 11:57:46 -05:00
AlanCoding
482395eb6a reduce default verbosity of devel-specific callback logging 2018-10-26 10:03:46 -04:00
Ryan Petrello
53ae05094e use the proper logger for the callback receiver 2018-10-17 10:56:29 -04:00
Ryan Petrello
ff1e8cc356 replace celery task decorators with a kombu-based publisher
this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:

- a new decorator, `awx.main.dispatch.task`, which can be used to
  decorate functions or classes so that they can be designated as
  "Tasks"
- support for fanout/broadcast tasks (at this point in time, only
  `conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
  `handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
  on demand
- minimal support for RPC, such as status checks and pool recycle/reload
2018-10-11 10:53:30 -04:00