Commit Graph

325 Commits

Author SHA1 Message Date
Seth Foster
2f82b75748 Add subsystem metrics for task manager 2022-06-14 11:00:11 -04:00
Seth Foster
eba4a3f1c2 in case we fail a job in task manager, we need to add the project update to the inventoryupdate.source_project field 2022-05-12 15:21:17 -04:00
Seth Foster
0ae9fe3624 if dependency fails, fail job in task manager 2022-05-12 14:00:13 -04:00
Seth Foster
1b662fcca5 SCM inv source trigger project update
- scm based inventory sources should launch project updates prior to
running inventory updates for that source.
- fixes scenario where a job is based on projectA, but the inventory
source is based on projectB. Running the job will likely trigger a
sync for projectA, but not projectB.

comments
2022-05-12 14:00:12 -04:00
Alan Rominger
cb63d92bbf Remove committed_capacity field, delete supporting code (#12086)
* Remove committed_capacity field, delete supporting code

* Track consumed capacity to solve the negatives problem

* Use more verbose name for IG queryset
2022-04-22 13:41:32 -04:00
Elijah DeLee
689a216726 move static methods used by task manager (#12050)
* move static methods used by task manager

These static methods were being used to act on Instance-like objects
that were SimpleNamespace objects with the necessary attributes.

This change introduces dedicated classes to replace the SimpleNamespace
objects and moves the formerlly staticmethods to a place where they are
more relevant instead of tacked onto models to which they were only
loosly related.

Accept in-memory data structure in init methods for tests

* initialize remaining capacity AFTER we built map of instances
2022-04-21 13:05:06 -04:00
Elijah DeLee
e24fc43a45 Revert "Only fetch fields we need in task manager"
This reverts commit 868e811b3f.

Turns out this does not play well with polymorphic models.

Will try again with .defer()
2022-04-14 11:55:33 -04:00
Elijah DeLee
868e811b3f Only fetch fields we need in task manager
By using .only we select fewer columns, avoiding potentially large
fields that we never reference.

Also, small tweak to eliminate what was a duplicate dictionary of
hostname:instance, because we don't need build and carry two copies of
the same data.
2022-04-13 17:24:33 -04:00
Elijah DeLee
2e9974133a calculate remaining capacity in static method
this is to avoid additional queries when we allready have all
the active jobs fetched in the task manager
2022-04-13 11:56:07 -04:00
Elijah DeLee
4328b4cb67 drop call that queries all running and waiting jobs
this is to fix one more place in the task manager where we end up
querying all running and waiting jobs.

Partial fix for https://github.com/ansible/awx/issues/11671
2022-04-12 10:31:47 -04:00
Jeff Bradberry
b852baaa39 Fix up logger .warn() calls to use .warning() instead
This is a usage that was deprecated in Python 3.0.
2022-03-07 18:11:36 -05:00
Jeff Bradberry
a3a216f91f Fix up new Django 3.0 deprecations
Mostly text based: force/smart_text, ugettext_*
2022-03-07 18:11:36 -05:00
Elijah DeLee
38f50f014b fix missing job lifecycle messages (#11801)
we were missing these messages for control type jobs that call start_task earlier than other types of jobs
2022-02-23 13:56:25 -05:00
Elijah DeLee
921b2bfb28 drop unused logic in task manager
There is no current need or use to keep a seperate dependency graph for
each instance group. In the interest of making it clearer what the
current code does, eliminate this superfluous complication.

We are no longer ever referencing any accounting of instance group
capacity, instead we only look
at capacity on intances.
2022-02-14 16:15:03 -05:00
Elijah DeLee
604cbc1737 Consume control capacity (#11665)
* Select control node before start task

Consume capacity on control nodes for controlling tasks and consider
remainging capacity on control nodes before selecting them.

This depends on the requirement that control and hybrid nodes should all
be in the instance group named 'controlplane'. Many tests do not satisfy that
requirement. I'll update the tests in another commit.

* update tests to use controlplane

We don't start any tasks if we don't have a controlplane instance group

Due to updates to fixtures, update tests to set node type and capacity
explicitly so they get expected result.

* Fixes for accounting of control capacity consumed

Update method is used to account for currently consumed capacity for
instance groups in the in-memory capacity tracking data structure we initialize in
after_lock_init and then update via calculate_capacity_consumed (both in
task_manager.py)

Also update fit_task_to_instance to consider control impact on instances

Trust that these functions do the right thing looking for a
node with capacity, and cut out redundant check for the whole group's
capacity per Alan's reccomendation.

* Refactor now redundant code

Deal with control type tasks before we loop over the preferred instance
groups, which cuts out the need for some redundant logic.

Also, fix a bug where I was missing assigning the execution node in one case!

* set job explanation on tasks that need capacity

move the job explanation for jobs that need capacity to a function
so we can re-use it in the three places we need it.

* project updates always run on the controlplane

Instance group ordering makes no sense on project updates because they
always need to run on the control plane.

Also, since hybrid nodes should always run the control processes for the
jobs running on them as execution nodes, account for this when looking for a
execution node.

* fix misleading message

the variables and wording were both misleading, fix to be more accurate
description in the two different cases where this log may be emitted.

* use settings correctly

use settings.DEFAULT_CONTROL_PLANE_QUEUE_NAME instead of a hardcoded
name
cache the controlplane_ig object during the after lock init to avoid
an uneccesary query
eliminate mistakenly duplicated AWX_CONTROL_PLANE_TASK_IMPACT and use
only AWX_CONTROL_NODE_TASK_IMPACT

* add test for control capacity consumption

add test to verify that when there are 2 jobs and only capacity for one
that one will move into waiting and the other stays in pending

* add test for hybrid node capacity consumption

assert that the hybrid node is used for both control and execution and
capacity is deducted correctly

* add test for task.capacity_type = control

Test that control type tasks have the right capacity consumed and
get assigned to the right instance group

Also fix lint in the tests

* jobs_running not accurate for control nodes

We can either NOT use "idle instances" for control nodes, or we need
to update the jobs_running property on the Instance model to count
jobs where the node is the controller_node.

I didn't do that because it may be an expensive query, and it would be
hard to make it match with jobs_running on the InstanceGroup which
filters on tasks assigned to the instance group.

This change chooses to stop considering "idle" control nodes an option,
since we can't acurrately identify them.

The way things are without any change, is we are continuing to over consume capacity on control nodes
because this method sees all control nodes as "idle" at the beginning
of the task manager run, and then only counts jobs started in that run
in the in-memory tracking. So jobs which last over a number of task
manager runs build up consuming capacity, which is accurately reported
via Instance.consumed_capacity

* Reduce default task impact for control nodes

This is something we can experiment with as far as what users
want at install time, but start with just 1 for now.

* update capacity docs

Describe usage of the new setting and the concept of control impact.

Co-authored-by: Alan Rominger <arominge@redhat.com>
Co-authored-by: Rebeccah <rhunter@redhat.com>
2022-02-14 10:13:22 -05:00
Amol Gautam
a4a3ba65d7 Refactored tasks.py to a package
--- Added 3 new sub-package : awx.main.tasks.system , awx.main.tasks.jobs , awx.main.tasks.receptor
--- Modified the functional tests and unit tests accordingly
2022-01-14 11:55:41 -05:00
Jeff Bradberry
f340f491dc Control the visibility and use of hop node Instances
- the list, detail, and health check API views should not include them
- the Instance-InstanceGroup association views should not allow them
  to be changed
- the ping view excludes them
- list_instances management command excludes them
- Instance.set_capacity_value sets hop nodes to 0 capacity
- TaskManager will exclude them from the nodes available for job execution
- TaskManager.reap_jobs_from_orphaned_instances will consider hop nodes
  to be an orphaned instance
- The apply_cluster_membership_policies task will not manipulate hop nodes
- get_broadcast_hosts will ignore hop nodes
- active_count also will ignore hop nodes
2021-12-17 14:30:28 -05:00
Elijah DeLee
e10030b73d Allow setting default execution group pod spec
This will allow us to control the default container group created via settings, meaning
we could set this in the operator and the default container group would get created with it applied.

We need this for https://github.com/ansible/awx-operator/issues/242

Deepmerge the default podspec and the override

With out this, providing the `spec` for the podspec would override everything
contained, which ends up including the container used, which is not desired

Also, use the same deepmerge function def, as the code seems to be copypasted from
the utils
2021-12-10 15:02:45 -05:00
Alan Rominger
b721a4b361 Remove dev-only log filters and downgrade periodic logs 2021-12-07 14:35:02 -05:00
chris meyers
9f8250bd47 add events to job lifecycle
* Note in the job lifecycle when the controller_node and execution_node
  are chosen. This event occurs most commonly in the task manager with a
  couple of exceptions that happen when we dynamically create dependenct
  jobs on the fly in tasks.py
2021-11-10 08:50:16 +08:00
Alan Rominger
62e9e7ea80 Avoid setting controller_node to an execution node for container jobs (#11117) 2021-09-23 09:16:10 -04:00
Alan Rominger
daf4310176 Clean up work_type processing and fix execution vs control capacity (#10930)
* Clean up added work_type processing for mesh_code branch

* track both execution and control capacity

* Remove unused execution_capacity property

* Count all forms of capacity to make test pass

* Force jobs to be on execution nodes, updates on control nodes

* Introduce capacity_type property to abstract some details out

* Update test to cover all job types at same time

* Register OpenShift nodes as control types

* Remove unqualified consumed_capacity from task manager and make unit tests work

* Remove unqualified consumed_capacity from task manager and make unit tests work

* Update unit test to execution vs control TM logic changes

* Fix bug, else handling for work_type method
2021-08-26 07:24:14 -04:00
Alan Rominger
c3ad479fc6 Minor tweaks for the mesh_code branch from review (#10902) 2021-08-24 08:41:35 -04:00
beeankha
1a9fcdccc2 Change place where controller node is being looked for in the task manager 2021-08-24 08:41:35 -04:00
Alan Rominger
f47eb126e2 Adopt the node_type field in receptor logic (#10802)
* Adopt the node_type field in receptor logic

* Refactor Instance.objects.register so we do not reset capacity to 0
2021-08-24 08:41:34 -04:00
Alan Rominger
13300bdbd4 Update rebase to keep old control plane capacity check
Also do some basic work to separate control versus execution capacity
  this is to assure that we don't send jobs to the control node
2021-08-24 08:40:19 -04:00
Ryan Petrello
05cb876df5 implement an initial development environment for receptor-based clusters 2021-08-24 08:40:18 -04:00
Shane McDonald
ec8ac6f1a7 Introduce distinct controlplane instance group 2021-06-07 11:25:59 -04:00
Jim Ladd
84af610a1f remove rebase cruft 2021-06-04 09:17:09 -07:00
Jim Ladd
7b188aafea lint 2021-06-04 09:17:09 -07:00
Ryan Petrello
c7ab3ea86e move the partition data migration to be a post-upgrade async process
this copies the approach we took with the bigint migration
2021-06-04 09:17:07 -07:00
Jim Ladd
67046513ae Push changes before rebasing 2021-06-04 09:17:07 -07:00
Jim Ladd
0eb1984b22 Only create partitions for regular jobs 2021-06-04 09:17:06 -07:00
Jim Ladd
c87d7b0d79 fix import 2021-06-04 09:17:06 -07:00
Jim Ladd
612e91263c auto-create partition 2021-06-04 09:17:06 -07:00
Christian M. Adams
fe02c0b157 Fix error msg wording and sdb docs 2021-06-03 14:24:18 -04:00
Christian M. Adams
36f47f3696 The list secrets role rule is no longer not needed for container groups 2021-05-26 14:38:56 -04:00
Christian M. Adams
536c02dc55 Simplify hostname parsing 2021-05-25 15:19:40 -04:00
Christian M. Adams
d607dfd5d8 Added error handling for pull secret creation requests
- Check (only) the existing secret to see if it's value would change.
2021-05-25 14:58:01 -04:00
Christian M. Adams
cea6d8c3cb Use utf-8 & properly parse hostname from registry URL 2021-05-25 14:44:42 -04:00
Christian M. Adams
8316a1d198 Create pull secret in cluster and use it in PodSpec
- base64 encode secret values before creating the secret
  - Construct valid .dockerconfigjson
  - Cancel jobs where it will obviously fail & error handling
  - Check if the secret exists first, then attempts to replace it if it does.
2021-05-25 14:44:42 -04:00
Yanis Guenane
562f78e53d Rename awx to automation for pod names 2021-05-04 14:17:45 +02:00
Bill Nottingham
c8cf28f266 Assorted renaming and string changes 2021-04-30 14:32:05 -04:00
softwarefactory-project-zuul[bot]
6bea5dd294 Merge pull request #9957 from jbradberry/isolated-removal
Isolated removal

SUMMARY
Removal of the isolated nodes feature.
ISSUE TYPE

Feature Pull Request

COMPONENT NAME

API

AWX VERSION

Reviewed-by: Alan Rominger <arominge@redhat.com>
Reviewed-by: Jeff Bradberry <None>
Reviewed-by: Elyézer Rezende <None>
Reviewed-by: Bianca Henderson <beeankha@gmail.com>
2021-04-29 19:15:43 +00:00
Alan Rominger
67f7998ab9 Modify formatting in response to black update 2021-04-26 10:51:27 -04:00
Jeff Bradberry
1819a7963a Make the necessary changes to the models
- remove InstanceGroup.controller
- remove Instance.last_isolated_check
- remove .is_isolated and .is_controller methods/properties
- remove .choose_online_controller_node() method
- remove .supports_isolation() and replace with .can_run_containerized
- simplify .can_run_containerized
2021-04-22 10:17:02 -04:00
Ryan Petrello
300f5a3a1f use flake8 to lint for a few things black doesn't catch
black does *not* warn about missing or extraneous imports,
so let's bring back flake8 in our linting to check for them
2021-04-12 12:55:39 -04:00
Shane McDonald
2d48b24ef2 Update pod reaper to work with receptor launched pods 2021-04-05 17:45:15 -04:00
Ryan Petrello
c2ef0a6500 move code linting to a stricter pep8-esque auto-formatting tool, black 2021-03-23 09:39:58 -04:00
Ryan Petrello
f850f8d3e0 introduce a new global flag for denoating K8S-based deployments
- In K8S-based installs, only container groups are intended to be used
  for playbook execution (JTs, adhoc, inventory updates), so in this
  scenario, other job types have a task impact of zero.
- In K8S-based installs, traditional instances have *zero* capacity
  (because they're only members of the control plane where services
  - http/s, local control plane execution - run)
- This commit also includes some changes that allow for the task manager
  to launch tasks with task_impact=0 on instances that have capacity=0
  (previously, an instance with zero capacity would never be selected
  as the "execution node"

This means that when IS_K8S=True, any Job Template associated with an
Instance Group will never actually go from pending -> running (because
there's no capacity - all playbooks must run through Container Groups).
For an improved ux, our intention is to introduce logic into the
operator install process such that the *default* group that's created at
install time is a *Container Group* that's configured to point at the
K8S cluster where awx itself is deployed.
2021-03-03 18:52:55 -05:00