Commit Graph

45 Commits

Author SHA1 Message Date
Alan Rominger
f377b5fdde Use runtime log utility moved to DAB (#15675)
* Use runtime log utility moved to DAB
2024-12-11 10:38:24 -05:00
Rick Elrod
48edb15a03 Prevent Dispatcher deadlock when Redis disappears (#14249)
This fixes https://github.com/ansible/awx/issues/14245 which has
more information about this issue.

This change addresses both:
- A clashing signal handler (registering a callback to fire when
  the task manager times out, and hitting that callback in cases
  where we didn't expect to). Make dispatcher timeout use
  SIGUSR1, not SIGTERM.
- Metrics not being reported should not make us crash, so that is
  now fixed as well.

Signed-off-by: Rick Elrod <rick@elrod.me>
Co-authored-by: Alan Rominger <arominge@redhat.com>
2023-07-18 10:43:46 -05:00
Alan Rominger
ef99770383 Add subsystem metrics for the dispatcher (#13989)
This adds a handful of metrics to /api/v2/metrics/ recorded from the dispatcher main process

Adds logic in the dispatcher period tasks to calculate these for the last collection interval
Reports worker count, task count, scale up events, and availability

Add data to demo grafana dashboard
2023-05-17 14:29:31 -04:00
Alan Rominger
f5785976be Update to comply with new black rules 2023-02-01 14:59:38 -05:00
Jeff Bradberry
e029cf7196 Remove the django-qsstats-magic dependency 2022-11-10 15:37:44 -05:00
Alan Rominger
cba780a8f8 Fix dispatcher connection deadlock w scheduler and cleanup 2022-10-19 12:12:15 -04:00
Alan Rominger
974f845059 Revert "Merge pull request #12584 from AlanCoding/lazy_workers"
This reverts commit 64157f7207, reversing
changes made to 9e8ba6ca09.
2022-08-28 23:04:13 -04:00
Alan Rominger
e0c59d12c1 Change data structure so we can conditionally reap waiting jobs 2022-08-17 16:00:30 -04:00
Alan Rominger
6719010050 Add back in cleanup call 2022-08-17 15:42:48 -04:00
Alan Rominger
ccd46a1c0f Move reaper logic into worker, avoiding bottlenecks 2022-08-17 15:42:47 -04:00
Alan Rominger
621833ef0e Add extra workers if computing based on memory
Co-authored-by: Elijah DeLee <kdelee@redhat.com>
2022-08-17 11:41:59 -04:00
Shane McDonald
3c51cb130f Add grace period settings for task manager timeout, and pod / job waiting reapers
Co-authored-by: Alan Rominger <arominge@redhat.com>
2022-08-17 11:39:01 -04:00
Shane McDonald
c649809eb2 Remove debug method that calls cleanup
- It's unclear why this was here.
- Removing it doesnt appear to cause any problems.
- It still gets called during heartbeats.
2022-08-17 11:35:43 -04:00
Alan Rominger
a3fef27002 Add logs to debug waiting bottlenecking 2022-08-17 11:33:49 -04:00
Alan Rominger
278db2cdde Split reaper for running and waiting jobs
Avoid running jobs that have already been reapted

Co-authored-by: Elijah DeLee <kdelee@redhat.com>

Remove unnecessary extra actions

Fix waiting jobs in other cases of reaping
2022-08-17 10:53:29 -04:00
Alan Rominger
c5985c4c81 Change lazy worker method name and adjust log 2022-08-10 16:12:03 -04:00
Alan Rominger
a9170236e1 Wait 60 seconds before scaling down a worker 2022-08-10 16:12:03 -04:00
Alan Rominger
f7e6a32444 Optimize task manager with debug toolbar, adjust prefetch (#12588) 2022-08-10 10:05:13 -04:00
Seth Foster
b3eb9e0193 pid kill each of the 3 task managers on timeout 2022-08-05 14:33:25 -04:00
Elijah DeLee
7eb0c7dd28 exit task manager loops early if we are timed out
add settings to define task manager timeout and grace period

This gives us still TASK_MANAGER_TIMEOUT_GRACE_PERIOD amount of time to
get out of the task manager.

Also, apply start task limit in WorkflowManager to starting pending
workflows
2022-08-05 14:33:24 -04:00
Jeff Bradberry
1803c5bdb4 Fix up usage of django-guid
It has replaced the class-based middleware, everything is
function-based now.
2022-03-14 13:19:57 -04:00
Jeff Bradberry
b852baaa39 Fix up logger .warn() calls to use .warning() instead
This is a usage that was deprecated in Python 3.0.
2022-03-07 18:11:36 -05:00
Elijah DeLee
4bd6c2a804 set max dispatch workers to same as max forks
Right now, without this, we end up with a different number for max_workers than max_forks. For example, on a control node with 16 Gi of RAM,
  max_mem_capacity  w/ 100 MB/fork = (16*1024)/100 --> 164
  max_workers = 5 * 16 --> 80

This means we would allow that control node to control up to 164 jobs, but all jobs after the 80th job will be stuck in `waiting` waiting for a dispatch worker to free up to run the job.
2022-02-24 10:53:54 -05:00
Elijah DeLee
799968460d Fixup conversion of memory and cpu settings to support k8s resource request format (#11725)
fix memory and cpu settings to suport k8s resource request format

* fix conversion of memory setting to bytes

This setting has not been getting set by default, and needed some fixing
up to be compatible with setting the memory in the same way as we set it
in the operator, as well as with other changes from last year which
assume that ansible runner is returning memory in bytes.

This way we can start setting this setting in the operator, and get a
more accurate reflection of how much memory is available to the control
pod in k8s.

On platforms where services are all sharing memory, we deduct a
penalty from the memory available. On k8s we don't need to do this
because the web, redis, and task containers each have memory
allocated to them.

* Support CPU setting expressed in units used by k8s

This setting has not been getting set by default, and needed some fixing
up to be compatible with setting the CPU resource request/limits in the
same way as we set it in the resource requests/limits.

This way we can start setting this setting in the
operator, and get a more accurate reflection of how much cpu is
available to the control pod in k8s.

Because cpu on k8s can be partial cores, migrate cpu field to decimal.

k8s does not allow granularity of less than 100m (equivalent to 0.1 cores), so only
store up to 1 decimal place.

fix analytics to deal with decimal cpu

need to use DjangoJSONEncoder when Decimal fields in data passed to
json.dumps
2022-02-15 14:08:24 -05:00
Alan Rominger
b721a4b361 Remove dev-only log filters and downgrade periodic logs 2021-12-07 14:35:02 -05:00
Ryan Petrello
c2ef0a6500 move code linting to a stricter pep8-esque auto-formatting tool, black 2021-03-23 09:39:58 -04:00
Ryan Petrello
3cc3cf1f80 add a per-request GUID and log as it travels through background services
see: https://github.com/ansible/awx/issues/9329
2021-02-17 12:54:13 -05:00
Ryan Petrello
57f8e48894 make --status more robust for dispatcher, and add support for receiver
make the --status flag work by fetching a periodically recorded snapshot
of internal process state; additionally, update the callback receiver to
*also* record these statistics so we can gain more insight into any
performance issues
2020-09-17 15:33:37 -04:00
Ryan Petrello
0df6409244 remove task state tracking from the callback receiver
we don't have support for displaying these stats anyways, so there's
no point in using resources tracking them, especially for high-volume
installs
2020-09-16 13:40:42 -04:00
Rebeccah
118e1b8df1 removing memchache mentions in comments
remove memcached folder as it is no longer needed, also address a couple grammatical errors
2020-06-18 15:52:59 -04:00
Ryan Petrello
8b1806d4ca add code for detecting (and killing) a hung task manager task 2020-02-26 07:53:04 -05:00
Ryan Petrello
8f33f1a6c2 remove another expensive logging lookup in the parent callback process 2020-01-24 16:46:32 -05:00
Ryan Petrello
306f504fb7 optimize the callback receiver to buffer writes on high throughput
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups

we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow

this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`

see: https://github.com/ansible/awx/issues/5514
2020-01-14 12:04:26 -05:00
AlanCoding
eec08fdcca Log case of duplicate UUIDs 2020-01-09 07:31:32 -05:00
Ryan Petrello
83550eeba0 make the callback receiver more robust to duplicate UUIDs from ansible 2019-11-01 09:24:52 -04:00
Ryan Petrello
32ee9838af use the correct logger for the callback receiver
the callback receiver and dispatcher share several modules, so add logic
to use the correct logger
2019-03-15 08:09:47 -04:00
Ryan Petrello
4707dc2a05 clean up some unnecessary dispatcher reaping code 2019-01-24 11:11:05 -05:00
Ryan Petrello
b2442d42a3 detect dead DB connections in the dispatcher when reaping jobs 2019-01-22 08:40:26 -05:00
Ryan Petrello
f223df303f convert py2 -> py3 2019-01-15 14:09:01 -05:00
Matthew Jones
7330102961 Remove a warning message for dispatcher pool for tests 2018-11-19 11:19:57 -05:00
Ryan Petrello
37234ca66e prevent the dispatcher from using a nonsensical max_workers value 2018-11-16 10:16:39 -05:00
Ryan Petrello
0d29bbfdc6 make the dispatcher more fault-tolerant to prolonged database outages 2018-10-18 20:00:07 -04:00
Ryan Petrello
720a634702 don't attempt to recover special QUIT messages in the worker pool
when `--reload` is sent to the dispatcher, it sends a special QUIT
message to each worker in the pool so that it will exit gracefully at
the next opportunity

when a worker process exits unexpectedly, the dispatcher attempts to
recover its queued messages and sends them to another worker in the
pool; in this scenario, we should _never_ re-enqueue these special
QUIT messages (because the process doesn't need to quit, it's already
gone)

To reproduce this race condition:

1.  Launch an adhoc that does `sleep 60`
2.  Run `awx-manage run_dispatcher --reload` to enqueue a `QUIT` message
    into the worker's queue
3.  Find the pid of the worker running the `sleep 60` and `SIGKILL` it.
4.  Observe that dispatcher attempts to requeue the `QUIT` message and
    logs a confusing error.
2018-10-15 12:17:52 -04:00
Ryan Petrello
ff1e8cc356 replace celery task decorators with a kombu-based publisher
this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:

- a new decorator, `awx.main.dispatch.task`, which can be used to
  decorate functions or classes so that they can be designated as
  "Tasks"
- support for fanout/broadcast tasks (at this point in time, only
  `conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
  `handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
  on demand
- minimal support for RPC, such as status checks and pool recycle/reload
2018-10-11 10:53:30 -04:00
Ryan Petrello
da74f1d01f refactor and test the callback receiver as a base for a task dispatcher 2018-10-11 10:53:26 -04:00