* Remove committed_capacity field, delete supporting code
* Track consumed capacity to solve the negatives problem
* Use more verbose name for IG queryset
* move static methods used by task manager
These static methods were being used to act on Instance-like objects
that were SimpleNamespace objects with the necessary attributes.
This change introduces dedicated classes to replace the SimpleNamespace
objects and moves the formerlly staticmethods to a place where they are
more relevant instead of tacked onto models to which they were only
loosly related.
Accept in-memory data structure in init methods for tests
* initialize remaining capacity AFTER we built map of instances
By using .only we select fewer columns, avoiding potentially large
fields that we never reference.
Also, small tweak to eliminate what was a duplicate dictionary of
hostname:instance, because we don't need build and carry two copies of
the same data.
- returns a special view to output the total number of children (and
grandchildren) events for all parents events for a job
value is the number of total children of that event
- intended to be consumed by the UI, as an efficient way to get the
number of children for a particular event
- see api/templates/api/job_job_events_children_summary.md for more info
* Grafana notifications: Fix panel/dashboardId type
Latest grafana fails with
Error sending notification grafana: 400
[{"classification":"DeserializationError",
"message":"json: cannot unmarshal string into Go struct
field PostAnnotationsCmd.dashboardId of type int64"}]
So ensure the IDs are really int and not strings.
* Fix the dashboard/panelId=0 case
0 is avlaid valid for the ID's, so ensure to allow them.
* Update tests to new behavior
Panel/Dashboard Id fields are not sent if they where not requested.
Alos add tests for the ID=0 case.
* Simple patches to make jobs robust to database restarts
* Add some wait time before retrying loop due to DB error
* Apply dispatcher downtime setting to job updates, fix dispatcher bug
This resolves a bug where the pg_is_down property
never had the right value
the loop is normally stuck in the conn.events() iterator
so it never recognized successful database interactions
this lead to serial database outages terminating jobs
New setting for allowable PG downtime is shared with task code
any calls to update_model will use _max_attempts parameter
to make it align with the patience time that the dispatcher
respects when consuming new events
* To avoid restart loops, handle DB errors on startup with prejudice
* If reconnect consistently fails, exit with non-zero code
There was a race condition because the callback reciever tried to run this code:
File "/awx_devel/awx/main/management/commands/run_callback_receiver.py", line 31, in handle
CallbackBrokerWorker(),
File "/awx_devel/awx/main/dispatch/worker/callback.py", line 49, in __init__
self.subsystem_metrics = s_metrics.Metrics(auto_pipe_execute=False)
File "/awx_devel/awx/main/analytics/subsystem_metrics.py", line 156, in __init__
self.instance_name = Instance.objects.me().hostname
Before get_or_register was being called by the dispatcher.
Occasionally the create_partition will error with,
relation "main_projectupdateevent_20220323_19" already exists
This change wraps the db command into a try except block with its
own transaction
This JSONBlob field type is a wrapper around Django's new generic
JSONField, but with the database column type forced to be text. This
should behave close enough to our old wrapper around
django-jsonfield's JSONField and will avoid needing to do the
out-of-band database migration.
* We trigger notifications when the callback receiver processes the
playbook_on_stats event. This is the last event in ansible-playbook and
the process should exist very shortly after this event is emitted. The
trouble comes in with the isolated node feature. There is a management
playbook that runs periodically that pulls the events from the remote
node. It's possible that the management playbooks runs, gets the
playbook_on_stats event, but does not see that the playbook is finished
running. Therefore the job status is still seen as 'running' BUT we have
kicked of the notification for the job. The notification worker will
enter a loop waiting on the job to enter the finished state. In this
case the time it takes for the job to enter the finished state can be
long, roughly 2 * the management playbook run time.
* This new setting allows the user to increase the time that the
notification spends waiting for the job to enter the finished state.