Commit Graph

131 Commits

Author SHA1 Message Date
John Westcott IV
cd4d83acb7 Compensating for NUL unicode characters
NUL characters are not allowed in text fields in the database

We used to strip them out of stdout but the exception changed

And we want to be sure to strip them out of JSONBlob fields
2023-06-14 17:40:15 -04:00
John Westcott IV
e47d30974c Removing psycopg2 references 2023-06-14 17:40:15 -04:00
Alan Rominger
fbaeb90268 Apply conservative database connection reduction changes (#14066)
This is expected to free up 4 additional database connections per traditional node
  compare to roughly 12 in total before this change

Out of these 3 are accomplished by using existing connection for recently added services
  then 1 is obtained by closing the connection for the idle callback receiver main process

Signed-off-by: jessicamack <jmack@redhat.com>
Co-authored-by: jessicamack <jmack@redhat.com>
2023-06-01 14:59:18 -04:00
Alan Rominger
ef99770383 Add subsystem metrics for the dispatcher (#13989)
This adds a handful of metrics to /api/v2/metrics/ recorded from the dispatcher main process

Adds logic in the dispatcher period tasks to calculate these for the last collection interval
Reports worker count, task count, scale up events, and availability

Add data to demo grafana dashboard
2023-05-17 14:29:31 -04:00
Alan Rominger
342e9197b8 Customize application_name for different connections in dispatcher service (#13074)
* Introduce new method in settings, import in-line w NOQA mark

* Further refine the app_name to use shorter service names like dispatcher

* Clean up listener logic, change some names
2023-04-13 22:36:36 -04:00
Hao Liu
c8c8ed1775 Raise ValueError when no ready and enabled task instance 2023-03-29 22:09:19 -04:00
Hao Liu
25303ee625 Only select task instance that are ready and enabled
When select a queue for task instance to run task only select task instance that are ready and enabled
2023-03-29 22:09:19 -04:00
Hao Liu
cd3f7666be add get_task_queuename
get_local_queuename will return the pod name of the instance

now that web and task are in different pods when web container queue a task it will be put into a queue without as task worker to execute the task
2023-03-29 22:09:19 -04:00
Jessica Mack
43f4872fec these methods don't need to be class methods
Signed-off-by: Jessica Mack <jmack@redhat.com>
2023-03-29 22:04:43 -04:00
Gabriel Muniz
e15f4de0dd Fix race with heartbeat and reaper logic (#13713)
* Fix race with heartbeat and reaper logic

* Fix tests to fail when over drift over heartbeat time

* replaced modified with started time for reap() code and added test

* fixed logic bug and cleaned up tests

* Added comments to tests to call out reasoning
2023-03-17 14:24:31 -04:00
Alan Rominger
6c1d4a5cfd Skip callback receiver bulk_create with 0 events 2023-02-04 12:10:39 -05:00
Alan Rominger
f5785976be Update to comply with new black rules 2023-02-01 14:59:38 -05:00
Alan Rominger
8a4059d266 Workaround for events with NUL char, touch up error loop (#13398)
* Workaround for events with NUL char, touch up error loop

This fixes an error where some events would not save
  due to having the 0x00 character which errors in postgres
  this adds a line to replace it with empty text

Hitting that kind of event put us in an infinite error loop
  so this change makes a number of changes to prevent similar loops
  the showcase example is a negative counter,
  this is not realistic in the real world but works for unit tests

These error loop fixes seek to esablish the cases where we clear the buffer
Some logic is removed from the outer loop, with the idea that
ensure_connection will better distinguish flake

* From review comments, delay NUL char sanitization to later

Use pop to make list operations more clear

* Fix incorrect use of pop
2023-01-19 13:36:23 -05:00
Jeff Bradberry
721e19e1c8 Merge pull request #13181 from jbradberry/remove-qsstats
Replace the querysets provided by django-qsstats-magic
2022-11-11 10:58:51 -05:00
Jeff Bradberry
e029cf7196 Remove the django-qsstats-magic dependency 2022-11-10 15:37:44 -05:00
Alan Rominger
1f939aa25e Merge pull request #12884 from AlanCoding/is_testing
[tech debt] Move the IS_TESTING method out of settings
2022-11-09 15:29:35 -05:00
Alan Rominger
192f45bbd0 Make canceling view non-atomic to fix 500 errors with job bursts (#13072)
* Make canceling view non-atomic to fix 500 errors with job bursts

* Update test calls for cancel method changes
2022-10-20 15:02:54 -04:00
Alan Rominger
cba780a8f8 Fix dispatcher connection deadlock w scheduler and cleanup 2022-10-19 12:12:15 -04:00
Alan Rominger
a64467c5a6 Shortcut Instance.objects.me when possible 2022-10-05 09:11:42 -04:00
Alan Rominger
cfce31419d Move the IS_TESTING method out of settings 2022-09-28 11:19:10 -04:00
Alan Rominger
c59bbdecdb Refactor canceling to work through messaging and signals, not database
If canceled attempted before, still allow attempting another cancel
in this case, attempt to send the sigterm signal again.
Keep clicking, you might help!

Replace other cancel_callbacks with sigterm watcher
  adapt special inventory mechanism for this too

Get rid of the cancel_watcher method with exception in main thread

Handle academic case of sigterm race condition

Process cancelation as control signal

Fully connect cancel method and run_dispatcher to control

Never transition workflows directly to canceled, add logs
2022-09-01 15:20:31 -04:00
Alan Rominger
01037fa561 Fix sanity check to use the relevant active connection 2022-08-29 16:33:07 -04:00
Alan Rominger
974f845059 Revert "Merge pull request #12584 from AlanCoding/lazy_workers"
This reverts commit 64157f7207, reversing
changes made to 9e8ba6ca09.
2022-08-28 23:04:13 -04:00
Alan Rominger
e87fabe6bb Submit job to dispatcher as part of transaction (#12573)
Make it so that submitting a task to the dispatcher happens as part of the transaction.
  this applies to dispatcher task "publishers" which NOTIFY the pg_notify queue
  if the transaction is not successful, it will not be sent, as per postgres docs

This keeps current behavior for pg_notify listeners
  practically, this only applies for the awx-manage run_dispatcher service
  this requires creating a separate connection and keeping it long-lived
  arbitrary code will occasionally close the main connection, which would stop listening

Stop sending the waiting status websocket message
  this is required because the ordering cannot be maintained with other changes here
  the instance group data is moved to the running websocket message payload

Move call to create_partition from task manager to pre_run_hook
  mock this in relevant unit tests
2022-08-18 09:43:53 -04:00
Alan Rominger
e0c59d12c1 Change data structure so we can conditionally reap waiting jobs 2022-08-17 16:00:30 -04:00
Alan Rominger
6719010050 Add back in cleanup call 2022-08-17 15:42:48 -04:00
Alan Rominger
ccd46a1c0f Move reaper logic into worker, avoiding bottlenecks 2022-08-17 15:42:47 -04:00
Alan Rominger
621833ef0e Add extra workers if computing based on memory
Co-authored-by: Elijah DeLee <kdelee@redhat.com>
2022-08-17 11:41:59 -04:00
Shane McDonald
16be38bb54 Allow for passing custom job_explanation to reaper methods
Co-authored-by: Alan Rominger <arominge@redhat.com>
2022-08-17 11:41:49 -04:00
Shane McDonald
3c51cb130f Add grace period settings for task manager timeout, and pod / job waiting reapers
Co-authored-by: Alan Rominger <arominge@redhat.com>
2022-08-17 11:39:01 -04:00
Shane McDonald
c649809eb2 Remove debug method that calls cleanup
- It's unclear why this was here.
- Removing it doesnt appear to cause any problems.
- It still gets called during heartbeats.
2022-08-17 11:35:43 -04:00
Alan Rominger
a3fef27002 Add logs to debug waiting bottlenecking 2022-08-17 11:33:49 -04:00
Alan Rominger
278db2cdde Split reaper for running and waiting jobs
Avoid running jobs that have already been reapted

Co-authored-by: Elijah DeLee <kdelee@redhat.com>

Remove unnecessary extra actions

Fix waiting jobs in other cases of reaping
2022-08-17 10:53:29 -04:00
Alan Rominger
64157f7207 Merge pull request #12584 from AlanCoding/lazy_workers
Wait 60 seconds before scaling down a worker
2022-08-17 10:18:19 -04:00
Alan Rominger
9e8ba6ca09 Merge pull request #12494 from AlanCoding/revival
Register system again if deleted by another pod
2022-08-17 10:12:39 -04:00
Alan Rominger
998000bfbe Surface correct error from bulk_create on unrecoverable error 2022-08-10 16:16:57 -04:00
Alan Rominger
43a50cc62c Fix event counting in error handling path 2022-08-10 16:16:57 -04:00
Alan Rominger
30f556f845 Further resiliency changes focused on offline database
Make logs from database outage more manageable

Raise exception if update_model never recovers from problem
2022-08-10 16:16:57 -04:00
Alan Rominger
c5985c4c81 Change lazy worker method name and adjust log 2022-08-10 16:12:03 -04:00
Alan Rominger
a9170236e1 Wait 60 seconds before scaling down a worker 2022-08-10 16:12:03 -04:00
Alan Rominger
f7e6a32444 Optimize task manager with debug toolbar, adjust prefetch (#12588) 2022-08-10 10:05:13 -04:00
Alan Rominger
585d3f4e2a Register system again if deleted by another pod
Avoid cases where missing instance
  would throw error on startup
  this gives time for heartbeat to register it
2022-08-08 22:36:17 -04:00
Seth Foster
b3eb9e0193 pid kill each of the 3 task managers on timeout 2022-08-05 14:33:25 -04:00
Elijah DeLee
7eb0c7dd28 exit task manager loops early if we are timed out
add settings to define task manager timeout and grace period

This gives us still TASK_MANAGER_TIMEOUT_GRACE_PERIOD amount of time to
get out of the task manager.

Also, apply start task limit in WorkflowManager to starting pending
workflows
2022-08-05 14:33:24 -04:00
Alan Rominger
fd671ecc9d Give specific messages if job was killed due to SIGTERM or SIGKILL (#12435)
* Reap jobs on dispatcher startup to increase clarity, replace existing reaping logic

* Exit jobs if receiving SIGTERM signal

* Fix unwanted reaping on shutdown, let subprocess close out

* Add some sanity tests for signal module

* Add a log for an unhandled dispatcher error

* Refine wording of error messages

Co-authored-by: Elijah DeLee <kdelee@redhat.com>
2022-06-30 13:20:08 -04:00
Rebeccah
5f9326b131 added average event processing metric (in seconds) that can be served to
grafana via prometheus.

This metric is a good indicator of how far behind the callback receiver
is. The higher the load the further behind/the greater the number of
seconds the metric will display.

This number being high may indicate the need for horizontal scaling in
the control plane or vertically scaling the number of callback
receivers.
2022-06-06 15:14:56 -04:00
Alan Rominger
1e6ca01686 Fix the callback receiver --status command 2022-05-19 15:00:49 -04:00
Alan Rominger
29d60844a8 Fix notification timing issue by sending in the latter of 2 events (#12110)
* Track host_status_counts and use that to process notifications

* Remove now unused setting

* Back out changes to callback class not needed after all

* Skirt the need for duck typing by leaning on the cached field

* Delete tests for deleted task

* Revert "Back out changes to callback class not needed after all"

This reverts commit 3b8ae350d218991d42bffd65ce4baac6f41926b2.

* Directly hardcode stats_event_type for callback class

* Fire notifications if stats event was never sent

* Remove test content for deleted methods

* Add placeholder for when no hosts matched

* Make field default be None, denote events processed with empty dict

* Make UI process null value for host_status_counts

* Fix tracking of EOF dispatch for system jobs

* Reorganize EVENT_MAP into class properties

* Consolidate conditional I missed from EVENT_MAP refactor

* Give up on the null condition, also applies for empty hosts

* Remove cls position argument not being used

* Move wrapup method out of class, add tests
2022-04-29 13:54:31 -04:00
Alan Rominger
73e02e745a Patches to make jobs robust to database restarts (#11905)
* Simple patches to make jobs robust to database restarts

* Add some wait time before retrying loop due to DB error

* Apply dispatcher downtime setting to job updates, fix dispatcher bug

This resolves a bug where the pg_is_down property
  never had the right value
  the loop is normally stuck in the conn.events() iterator
  so it never recognized successful database interactions
  this lead to serial database outages terminating jobs

New setting for allowable PG downtime is shared with task code
  any calls to update_model will use _max_attempts parameter
  to make it align with the patience time that the dispatcher
  respects when consuming new events

* To avoid restart loops, handle DB errors on startup with prejudice

* If reconnect consistently fails, exit with non-zero code
2022-03-30 09:14:20 -04:00
Alan Rominger
fe5736dc7f Specifically abort the reaper if instance not registered 2022-03-29 14:08:58 -04:00