This fixes https://github.com/ansible/awx/issues/14245 which has
more information about this issue.
This change addresses both:
- A clashing signal handler (registering a callback to fire when
the task manager times out, and hitting that callback in cases
where we didn't expect to). Make dispatcher timeout use
SIGUSR1, not SIGTERM.
- Metrics not being reported should not make us crash, so that is
now fixed as well.
Signed-off-by: Rick Elrod <rick@elrod.me>
Co-authored-by: Alan Rominger <arominge@redhat.com>
This adds a handful of metrics to /api/v2/metrics/ recorded from the dispatcher main process
Adds logic in the dispatcher period tasks to calculate these for the last collection interval
Reports worker count, task count, scale up events, and availability
Add data to demo grafana dashboard
Avoid running jobs that have already been reapted
Co-authored-by: Elijah DeLee <kdelee@redhat.com>
Remove unnecessary extra actions
Fix waiting jobs in other cases of reaping
add settings to define task manager timeout and grace period
This gives us still TASK_MANAGER_TIMEOUT_GRACE_PERIOD amount of time to
get out of the task manager.
Also, apply start task limit in WorkflowManager to starting pending
workflows
Right now, without this, we end up with a different number for max_workers than max_forks. For example, on a control node with 16 Gi of RAM,
max_mem_capacity w/ 100 MB/fork = (16*1024)/100 --> 164
max_workers = 5 * 16 --> 80
This means we would allow that control node to control up to 164 jobs, but all jobs after the 80th job will be stuck in `waiting` waiting for a dispatch worker to free up to run the job.
fix memory and cpu settings to suport k8s resource request format
* fix conversion of memory setting to bytes
This setting has not been getting set by default, and needed some fixing
up to be compatible with setting the memory in the same way as we set it
in the operator, as well as with other changes from last year which
assume that ansible runner is returning memory in bytes.
This way we can start setting this setting in the operator, and get a
more accurate reflection of how much memory is available to the control
pod in k8s.
On platforms where services are all sharing memory, we deduct a
penalty from the memory available. On k8s we don't need to do this
because the web, redis, and task containers each have memory
allocated to them.
* Support CPU setting expressed in units used by k8s
This setting has not been getting set by default, and needed some fixing
up to be compatible with setting the CPU resource request/limits in the
same way as we set it in the resource requests/limits.
This way we can start setting this setting in the
operator, and get a more accurate reflection of how much cpu is
available to the control pod in k8s.
Because cpu on k8s can be partial cores, migrate cpu field to decimal.
k8s does not allow granularity of less than 100m (equivalent to 0.1 cores), so only
store up to 1 decimal place.
fix analytics to deal with decimal cpu
need to use DjangoJSONEncoder when Decimal fields in data passed to
json.dumps
make the --status flag work by fetching a periodically recorded snapshot
of internal process state; additionally, update the callback receiver to
*also* record these statistics so we can gain more insight into any
performance issues
additionaly, optimize away several per-event host lookups and
changed/failed propagation lookups
we've always performed these (fairly expensive) queries *on every event
save* - if you're processing tens of thousands of events in short
bursts, this is way too slow
this commit also introduces a new command for profiling the insertion
rate of events, `awx-manage callback_stats`
see: https://github.com/ansible/awx/issues/5514
when `--reload` is sent to the dispatcher, it sends a special QUIT
message to each worker in the pool so that it will exit gracefully at
the next opportunity
when a worker process exits unexpectedly, the dispatcher attempts to
recover its queued messages and sends them to another worker in the
pool; in this scenario, we should _never_ re-enqueue these special
QUIT messages (because the process doesn't need to quit, it's already
gone)
To reproduce this race condition:
1. Launch an adhoc that does `sleep 60`
2. Run `awx-manage run_dispatcher --reload` to enqueue a `QUIT` message
into the worker's queue
3. Find the pid of the worker running the `sleep 60` and `SIGKILL` it.
4. Observe that dispatcher attempts to requeue the `QUIT` message and
logs a confusing error.
this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:
- a new decorator, `awx.main.dispatch.task`, which can be used to
decorate functions or classes so that they can be designated as
"Tasks"
- support for fanout/broadcast tasks (at this point in time, only
`conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
`handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
on demand
- minimal support for RPC, such as status checks and pool recycle/reload