* Simple patches to make jobs robust to database restarts
* Add some wait time before retrying loop due to DB error
* Apply dispatcher downtime setting to job updates, fix dispatcher bug
This resolves a bug where the pg_is_down property
never had the right value
the loop is normally stuck in the conn.events() iterator
so it never recognized successful database interactions
this lead to serial database outages terminating jobs
New setting for allowable PG downtime is shared with task code
any calls to update_model will use _max_attempts parameter
to make it align with the patience time that the dispatcher
respects when consuming new events
* To avoid restart loops, handle DB errors on startup with prejudice
* If reconnect consistently fails, exit with non-zero code
There was a race condition because the callback reciever tried to run this code:
File "/awx_devel/awx/main/management/commands/run_callback_receiver.py", line 31, in handle
CallbackBrokerWorker(),
File "/awx_devel/awx/main/dispatch/worker/callback.py", line 49, in __init__
self.subsystem_metrics = s_metrics.Metrics(auto_pipe_execute=False)
File "/awx_devel/awx/main/analytics/subsystem_metrics.py", line 156, in __init__
self.instance_name = Instance.objects.me().hostname
Before get_or_register was being called by the dispatcher.
Occasionally the create_partition will error with,
relation "main_projectupdateevent_20220323_19" already exists
This change wraps the db command into a try except block with its
own transaction
This JSONBlob field type is a wrapper around Django's new generic
JSONField, but with the database column type forced to be text. This
should behave close enough to our old wrapper around
django-jsonfield's JSONField and will avoid needing to do the
out-of-band database migration.
* We trigger notifications when the callback receiver processes the
playbook_on_stats event. This is the last event in ansible-playbook and
the process should exist very shortly after this event is emitted. The
trouble comes in with the isolated node feature. There is a management
playbook that runs periodically that pulls the events from the remote
node. It's possible that the management playbooks runs, gets the
playbook_on_stats event, but does not see that the playbook is finished
running. Therefore the job status is still seen as 'running' BUT we have
kicked of the notification for the job. The notification worker will
enter a loop waiting on the job to enter the finished state. In this
case the time it takes for the job to enter the finished state can be
long, roughly 2 * the management playbook run time.
* This new setting allows the user to increase the time that the
notification spends waiting for the job to enter the finished state.