Commit Graph

365 Commits

Author SHA1 Message Date
jessicamack
209747d88e Update for django-ansible-base split (#14783)
* update paths and names

* temp to get tests passing

* fix typo
2024-01-19 12:30:32 -05:00
John Westcott IV
aacf9653c5 Use filtering/sorting from django-ansible-base (#14726)
* Move filtering to DAB

* add comment to trigger building a new image

Signed-off-by: jessicamack <jmack@redhat.com>

* remove unneeded comment

Signed-off-by: jessicamack <jmack@redhat.com>

* remove unused imports

Signed-off-by: jessicamack <jmack@redhat.com>

* change mock import

Signed-off-by: jessicamack <jmack@redhat.com>

---------

Signed-off-by: jessicamack <jmack@redhat.com>
Co-authored-by: jessicamack <jmack@redhat.com>
2023-12-18 10:05:02 -05:00
Alan Rominger
333ef76cbd Send notifications for dependency failures (#14603)
* Send notifications for dependency failures

* Delete tests for deleted method

* Remove another test for removed method
2023-10-30 10:42:37 -04:00
Hao Liu
bb3acbb8ad Debug log for scheduler commit duration (#14035)
Co-authored-by: Alan Rominger <arominge@redhat.com>
2023-09-27 09:46:55 -04:00
Alan Rominger
ab5cc2e69c Simplifications for DependencyManager (#13533) 2023-07-27 15:42:29 -04:00
Alan Rominger
98bfe3f43f Add missing trigger for failed-to-start nodes (#13802) 2023-07-24 12:17:46 -04:00
Rick Elrod
48edb15a03 Prevent Dispatcher deadlock when Redis disappears (#14249)
This fixes https://github.com/ansible/awx/issues/14245 which has
more information about this issue.

This change addresses both:
- A clashing signal handler (registering a callback to fire when
  the task manager times out, and hitting that callback in cases
  where we didn't expect to). Make dispatcher timeout use
  SIGUSR1, not SIGTERM.
- Metrics not being reported should not make us crash, so that is
  now fixed as well.

Signed-off-by: Rick Elrod <rick@elrod.me>
Co-authored-by: Alan Rominger <arominge@redhat.com>
2023-07-18 10:43:46 -05:00
Hao Liu
cd3f7666be add get_task_queuename
get_local_queuename will return the pod name of the instance

now that web and task are in different pods when web container queue a task it will be put into a queue without as task worker to execute the task
2023-03-29 22:09:19 -04:00
Alan Rominger
94b34b801c Avoid unbounded kwargs by fetching subtasks inside handle_work_error
Update tests to new handle_work_error call pattern

Handle blame correctly with multiple serial deps
  add new test case corresponding to this scenario
2022-12-19 16:02:51 -05:00
Elijah DeLee
71f326b705 filter tasks when instance groups are filtered
this is necessary when requests are made to to
api/v2/job_templates/ID/instance_groups

Thanks to Sarah who found this!
2022-11-30 17:14:33 -05:00
Elijah DeLee
e403c603d6 use task manager models more consistently in serializer 2022-11-30 17:14:33 -05:00
Elijah DeLee
86856f242a Add max concurrent jobs and max forks per ig
The intention of this feature is primarily to provide some notion of max
capacity of container groups, but the logic I've left generic. Default
is 0, which will be interpereted as no maximum number of jobs or forks.

Includes refactor of variable and method names for clarity.
instances_by_hostname is an internal attribute of TaskManagerInstances.
Clarify when we are expecting the actual TaskManagerInstances object.

Unify how we process running tasks and consume capacity. This has the
effect that we do less expensive work in after_lock_init and have 1 less
loop over all the running tasks. Previously we looped for both building
the dependency graph as well as for calculating the starting capacity of
all the instances and instance groups. Now we acheive both tasks in the
same loop.

Because of how this changes the somewhat subtle "do-si-do" of how to
initialize the Task Manager models, introduce a wrapper class that tries
to take some of that burden off of other areas where we re-use this like
in the serializer and the metrics. Also use this wrapper class to handle
nicities of how to track capacity consumption on instances and instance
groups.

Add tests for max_forks and max_concurrent_jobs

Fixup tests that use TaskManagerModels to accomodate changes.

assign ig before call to consume capacity

if we don't do it in that order, then we don't correctly account for
the container group jobs we are starting in the middle of the task
manager run
2022-11-30 17:14:33 -05:00
Alan Rominger
cfce31419d Move the IS_TESTING method out of settings 2022-09-28 11:19:10 -04:00
Alan Rominger
5648d9d96f Avoid cache warning for dispatching control type tasks 2022-09-27 15:18:13 -04:00
Shane McDonald
9b034ad574 generate control node receptor.conf
when a new remote execution/hop node is added
regenerate the receptor.conf for all control node to
peer out to the new remote execution node

Signed-off-by: Hao Liu <haoli@redhat.com>
Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com>
Co-Authored-By: Shane McDonald <me@shanemcd.com>
2022-09-23 09:46:12 -04:00
Jeff Bradberry
604fac2295 Update task management to only do things with ready instances 2022-09-23 09:46:11 -04:00
Alan Rominger
2437a84b48 Minor changes to instance loop structure 2022-08-29 14:28:50 -04:00
Elijah DeLee
99815f8962 calcuate consumed capacity in same way in metrics
We should be consistent about this. Also this takes us from doing a as
many queries to the UnifiedJob table as we have instances to doing 1
query to the UnifiedJob table (and both do 1 query to Instances table)
2022-08-26 11:40:36 -04:00
Alan Rominger
e87fabe6bb Submit job to dispatcher as part of transaction (#12573)
Make it so that submitting a task to the dispatcher happens as part of the transaction.
  this applies to dispatcher task "publishers" which NOTIFY the pg_notify queue
  if the transaction is not successful, it will not be sent, as per postgres docs

This keeps current behavior for pg_notify listeners
  practically, this only applies for the awx-manage run_dispatcher service
  this requires creating a separate connection and keeping it long-lived
  arbitrary code will occasionally close the main connection, which would stop listening

Stop sending the waiting status websocket message
  this is required because the ordering cannot be maintained with other changes here
  the instance group data is moved to the running websocket message payload

Move call to create_partition from task manager to pre_run_hook
  mock this in relevant unit tests
2022-08-18 09:43:53 -04:00
Seth Foster
55d295c2a6 Add metric to measure task manager transaction, including on_commit calls 2022-08-15 12:44:29 -04:00
Alan Rominger
f7e6a32444 Optimize task manager with debug toolbar, adjust prefetch (#12588) 2022-08-10 10:05:13 -04:00
Seth Foster
e6f8852b05 Cache task_impact
task_impact is now a field on the database
It is calculated and set during create_unified_job

set task_impact on .save for adhoc commands
2022-08-05 14:33:47 -04:00
Alan Rominger
d06a3f060d Block sliced workflow jobs on any job type from their JT (#12551) 2022-08-05 14:33:45 -04:00
Seth Foster
957b2b7188 Cache preferred instance groups
When creating unified job, stash the list of pk values from the
instance groups returned from preferred_instance_groups so that the
task management system does not need to call out to this method
repeatedly.

.preferred_instance_groups_cache is the new field
2022-08-05 14:33:28 -04:00
Alan Rominger
b94b3a1e91 [task_manager_refactor] Move approval node expiration logic into queryset (#12502)
Instead of loading all pending Workflow Approvals in the task manager,
  run a query that will only return the expired apporovals
  directly expire all which are returned by that query

Cache expires time as a new field in order to simplify WorkflowApproval filter
2022-08-05 14:33:27 -04:00
Elijah DeLee
7776a81e22 add job to dependency graph in start task
We always add the job to the graph right before calling start task.
Reduce complexity of proper operation by just doing this in start_task,
because if you call start_task, you need to add it to the dependency
graph
2022-08-05 14:33:26 -04:00
Elijah DeLee
bf89093fac unify call pattern for get_tasks 2022-08-05 14:33:26 -04:00
Elijah DeLee
76d76d13b0 Start pending workflows in TaskManager
we had tried doing this in the WorkflowManager, but we decided that
we want to handle ALL pending jobs and "soft blockers" to jobs with the
TaskManager/DependencyGraph and not duplicate that logic in the
WorkflowManager.
2022-08-05 14:33:26 -04:00
Elijah DeLee
e603c23b40 fix sliced jobs blocking logic in depedency graph
We have to look at the sliced job's unified_job_template_id
Now, task_blocked_by works for sliced jobs too.
2022-08-05 14:33:26 -04:00
Alan Rominger
8af4dd5988 Fix unintended slice job blocking 2022-08-05 14:33:25 -04:00
Seth Foster
0a47d05d26 split schedule_task_manager into 3
each call to schedule_task_manager becomes one of

ScheduleTaskManager
ScheduleDependencyManager
ScheduleWorkflowManager
2022-08-05 14:33:25 -04:00
Seth Foster
b3eb9e0193 pid kill each of the 3 task managers on timeout 2022-08-05 14:33:25 -04:00
Elijah DeLee
b26d2ab0e9 fix looking at wrong id for wf allow_simultaneous 2022-08-05 14:33:25 -04:00
Elijah DeLee
7eb0c7dd28 exit task manager loops early if we are timed out
add settings to define task manager timeout and grace period

This gives us still TASK_MANAGER_TIMEOUT_GRACE_PERIOD amount of time to
get out of the task manager.

Also, apply start task limit in WorkflowManager to starting pending
workflows
2022-08-05 14:33:24 -04:00
Elijah DeLee
236c1df676 fix lint errors 2022-08-05 14:33:24 -04:00
Seth Foster
ff118f2177 Manage pending workflow jobs in Workflow Manager
get_tasks uses UnifiedJob
Additionally, make local overrides run after development settings
2022-08-05 14:31:48 -04:00
Elijah DeLee
29d91da1d2 we can do all the work in one loop
more than saving the loop, we save building the WorkflowDag twice which
makes LOTS of queries!!!

Also, do a bulk update on the WorkflowJobNodes instead of saving in a
loop :fear:
2022-08-05 14:31:48 -04:00
Elijah DeLee
ad08eafb9a add debug views for task manager(s)
implement https://github.com/ansible/awx/issues/12446
in development environment, enable set of views that run
the task manager(s).

Also introduce a setting that disables any calls to schedule()
that do not originate from the debug views when in the development
environment. With guards around both if we are in the development
environment and the setting, I think we're pretty safe this won't get
triggered unintentionally.

use MODE to determine if we are in devel env

Also, move test for skipping task managers to the tasks file
2022-08-05 14:31:24 -04:00
Seth Foster
431b9370df Split TaskManager into
- DependencyManager spawns dependencies if necessary
- WorkflowManager processes running workflows to see if a new job is
  ready to spawn
- TaskManager starts tasks if unblocked and has execution capacity
2022-08-05 14:29:02 -04:00
Alan Rominger
783b744bdb Pass combined artifacts from nested workflows into downstream nodes (#12223)
* Track combined artifacts on workflow jobs

* Avoid schema change for passing nested workflow artifacts

* Basic support for nested workflow artifacts, add test

* Forgot that only does not work with polymorphic

* Remove incorrect field

* Consolidate logic and prevent recursion with UJ artifacts method

* Stop trying to do precedence by status, filter for obvious ones

* Review comments about sets

* Fix up bug with convergence node paths and artifacts
2022-06-23 16:54:53 -03:00
Seth Foster
2f82b75748 Add subsystem metrics for task manager 2022-06-14 11:00:11 -04:00
Seth Foster
eba4a3f1c2 in case we fail a job in task manager, we need to add the project update to the inventoryupdate.source_project field 2022-05-12 15:21:17 -04:00
Seth Foster
0ae9fe3624 if dependency fails, fail job in task manager 2022-05-12 14:00:13 -04:00
Seth Foster
1b662fcca5 SCM inv source trigger project update
- scm based inventory sources should launch project updates prior to
running inventory updates for that source.
- fixes scenario where a job is based on projectA, but the inventory
source is based on projectB. Running the job will likely trigger a
sync for projectA, but not projectB.

comments
2022-05-12 14:00:12 -04:00
Alan Rominger
cb63d92bbf Remove committed_capacity field, delete supporting code (#12086)
* Remove committed_capacity field, delete supporting code

* Track consumed capacity to solve the negatives problem

* Use more verbose name for IG queryset
2022-04-22 13:41:32 -04:00
Elijah DeLee
689a216726 move static methods used by task manager (#12050)
* move static methods used by task manager

These static methods were being used to act on Instance-like objects
that were SimpleNamespace objects with the necessary attributes.

This change introduces dedicated classes to replace the SimpleNamespace
objects and moves the formerlly staticmethods to a place where they are
more relevant instead of tacked onto models to which they were only
loosly related.

Accept in-memory data structure in init methods for tests

* initialize remaining capacity AFTER we built map of instances
2022-04-21 13:05:06 -04:00
Elijah DeLee
e24fc43a45 Revert "Only fetch fields we need in task manager"
This reverts commit 868e811b3f.

Turns out this does not play well with polymorphic models.

Will try again with .defer()
2022-04-14 11:55:33 -04:00
Elijah DeLee
868e811b3f Only fetch fields we need in task manager
By using .only we select fewer columns, avoiding potentially large
fields that we never reference.

Also, small tweak to eliminate what was a duplicate dictionary of
hostname:instance, because we don't need build and carry two copies of
the same data.
2022-04-13 17:24:33 -04:00
Elijah DeLee
2e9974133a calculate remaining capacity in static method
this is to avoid additional queries when we allready have all
the active jobs fetched in the task manager
2022-04-13 11:56:07 -04:00
Elijah DeLee
4328b4cb67 drop call that queries all running and waiting jobs
this is to fix one more place in the task manager where we end up
querying all running and waiting jobs.

Partial fix for https://github.com/ansible/awx/issues/11671
2022-04-12 10:31:47 -04:00