If a job fails, we do receptor work results and put that output
into result_traceback.
We should only do this if
1. Receptor unit has failed
2. Runner callback processed 0 events
Otherwise we risk putting too much data into this field.
This fixes several things related to our wsbroadcast stats handling.
This was found during the ongoing wsrelay work.
There are really three fixes here:
- Logging was not actually enabled for the analytics.broadcast_websocket
module, so that has been added to our loggers config.
- analytics.broadcast_websocket was not actually able to connect to
Redis due to 68614b83c0 as part of
the work in #13187. But there was no easy way to know this because the
logging issue meant no exceptions showed up anywhere reasonable.
- Relatedly, and also as part of #13187, we jumped from
`prometheus-client` 0.7.1 up to 0.15.0. This included a breaking
change where a `Counter` ending with `_total` will clash with a
`Gauge` of the same name but without `_total`. I am not 100% sure of
the reasoning here, other than "OpenMetrics compatibility".
Refs #13301
Refs #13187
Signed-off-by: Rick Elrod <rick@elrod.me>
This will allow users of the operator to set these settings
so from the start when the operator creates the default
execution queue they can control the max_forks and max_concurrent_jobs
on the default container group.
The intention of this feature is primarily to provide some notion of max
capacity of container groups, but the logic I've left generic. Default
is 0, which will be interpereted as no maximum number of jobs or forks.
Includes refactor of variable and method names for clarity.
instances_by_hostname is an internal attribute of TaskManagerInstances.
Clarify when we are expecting the actual TaskManagerInstances object.
Unify how we process running tasks and consume capacity. This has the
effect that we do less expensive work in after_lock_init and have 1 less
loop over all the running tasks. Previously we looped for both building
the dependency graph as well as for calculating the starting capacity of
all the instances and instance groups. Now we acheive both tasks in the
same loop.
Because of how this changes the somewhat subtle "do-si-do" of how to
initialize the Task Manager models, introduce a wrapper class that tries
to take some of that burden off of other areas where we re-use this like
in the serializer and the metrics. Also use this wrapper class to handle
nicities of how to track capacity consumption on instances and instance
groups.
Add tests for max_forks and max_concurrent_jobs
Fixup tests that use TaskManagerModels to accomodate changes.
assign ig before call to consume capacity
if we don't do it in that order, then we don't correctly account for
the container group jobs we are starting in the middle of the task
manager run
Since the original version of the migration a) invoked the .save()
method, and b) involved a model with a custom field that had a
post_save handler attached, this migration had a side-effect that
caused the codebase's version of the model to be used when the table
involved wasn't yet up to date. This triggers an UndefinedColumn error.
This change works around the problem by making use of queryset
.update() methods instead, which should avoid the post_save signal
trigger.
aioredist was superceeded by redis
Someone referenced this directly but didn't add it to requirements.in. So when we upgraded channels-redis and it dropped aioredis this started failing
* Facts scaling fixes for large inventory, timing issue
Move save of Ansible facts to before the job status changes
this is considered an acceptable delay with the other
performance fixes here
Remove completely unrelated unused facts method
Scale related changes to facts saving:
Use .iterator() on queryset when looping
Change save to bulk_update
Apply bulk_update in batches of 100, to reduce memory
Only save a single file modtime, avoiding large dict
Use decorator for long func time logging
update decorator to fill in format statement
* Fixes#13119#13120 Cloud support & update brand
* rm base64 import to pass lint
* Update references across the board
* Removed final reference to CyberArk Conjur Secret Lookup
Previously, in some cases, an InventoryUpdate sourced by an SCM project
would still run and be successful even after the project it is sourced
from failed to update. This would happen because the InventoryUpdate
would revert the project back to its last working revision. This
behavior is confusing and inconsistent with how we handle jobs (which
just refuse to launch when the project is failed).
This change pulls out the logic that the job launch serializer and
RunJob#pre_run_hook had implemented (independently) to check if the
project is in a failed state, and puts it into a method on the Project
model. This is then checked in the project launch serializer as well as
the inventory update serializer, along with
SourceControlMixin#sync_and_copy as a fallback for things that don't run
the serializer validation (such as scheduled jobs and WFJT jobs).
Signed-off-by: Rick Elrod <rick@elrod.me>
This takes some logic out of the queryset logic,
using some established assumptions about the task manager
if a job lands on a hybrid node (or is a project update) then
it will have the same controller and execution node
With that established, the queryset can be simplified