Commit Graph

29 Commits

Author SHA1 Message Date
Elijah DeLee
604cbc1737 Consume control capacity (#11665)
* Select control node before start task

Consume capacity on control nodes for controlling tasks and consider
remainging capacity on control nodes before selecting them.

This depends on the requirement that control and hybrid nodes should all
be in the instance group named 'controlplane'. Many tests do not satisfy that
requirement. I'll update the tests in another commit.

* update tests to use controlplane

We don't start any tasks if we don't have a controlplane instance group

Due to updates to fixtures, update tests to set node type and capacity
explicitly so they get expected result.

* Fixes for accounting of control capacity consumed

Update method is used to account for currently consumed capacity for
instance groups in the in-memory capacity tracking data structure we initialize in
after_lock_init and then update via calculate_capacity_consumed (both in
task_manager.py)

Also update fit_task_to_instance to consider control impact on instances

Trust that these functions do the right thing looking for a
node with capacity, and cut out redundant check for the whole group's
capacity per Alan's reccomendation.

* Refactor now redundant code

Deal with control type tasks before we loop over the preferred instance
groups, which cuts out the need for some redundant logic.

Also, fix a bug where I was missing assigning the execution node in one case!

* set job explanation on tasks that need capacity

move the job explanation for jobs that need capacity to a function
so we can re-use it in the three places we need it.

* project updates always run on the controlplane

Instance group ordering makes no sense on project updates because they
always need to run on the control plane.

Also, since hybrid nodes should always run the control processes for the
jobs running on them as execution nodes, account for this when looking for a
execution node.

* fix misleading message

the variables and wording were both misleading, fix to be more accurate
description in the two different cases where this log may be emitted.

* use settings correctly

use settings.DEFAULT_CONTROL_PLANE_QUEUE_NAME instead of a hardcoded
name
cache the controlplane_ig object during the after lock init to avoid
an uneccesary query
eliminate mistakenly duplicated AWX_CONTROL_PLANE_TASK_IMPACT and use
only AWX_CONTROL_NODE_TASK_IMPACT

* add test for control capacity consumption

add test to verify that when there are 2 jobs and only capacity for one
that one will move into waiting and the other stays in pending

* add test for hybrid node capacity consumption

assert that the hybrid node is used for both control and execution and
capacity is deducted correctly

* add test for task.capacity_type = control

Test that control type tasks have the right capacity consumed and
get assigned to the right instance group

Also fix lint in the tests

* jobs_running not accurate for control nodes

We can either NOT use "idle instances" for control nodes, or we need
to update the jobs_running property on the Instance model to count
jobs where the node is the controller_node.

I didn't do that because it may be an expensive query, and it would be
hard to make it match with jobs_running on the InstanceGroup which
filters on tasks assigned to the instance group.

This change chooses to stop considering "idle" control nodes an option,
since we can't acurrately identify them.

The way things are without any change, is we are continuing to over consume capacity on control nodes
because this method sees all control nodes as "idle" at the beginning
of the task manager run, and then only counts jobs started in that run
in the in-memory tracking. So jobs which last over a number of task
manager runs build up consuming capacity, which is accurately reported
via Instance.consumed_capacity

* Reduce default task impact for control nodes

This is something we can experiment with as far as what users
want at install time, but start with just 1 for now.

* update capacity docs

Describe usage of the new setting and the concept of control impact.

Co-authored-by: Alan Rominger <arominge@redhat.com>
Co-authored-by: Rebeccah <rhunter@redhat.com>
2022-02-14 10:13:22 -05:00
Jeff Bradberry
bb14a95076 Remove the Instance.objects.active_count() method
Literally nothing uses it.  The similar Host.objects.active_count()
method seems to be what is actually important for licensing.
2022-01-14 16:21:41 -05:00
Amol Gautam
a4a3ba65d7 Refactored tasks.py to a package
--- Added 3 new sub-package : awx.main.tasks.system , awx.main.tasks.jobs , awx.main.tasks.receptor
--- Modified the functional tests and unit tests accordingly
2022-01-14 11:55:41 -05:00
Jeff Bradberry
f340f491dc Control the visibility and use of hop node Instances
- the list, detail, and health check API views should not include them
- the Instance-InstanceGroup association views should not allow them
  to be changed
- the ping view excludes them
- list_instances management command excludes them
- Instance.set_capacity_value sets hop nodes to 0 capacity
- TaskManager will exclude them from the nodes available for job execution
- TaskManager.reap_jobs_from_orphaned_instances will consider hop nodes
  to be an orphaned instance
- The apply_cluster_membership_policies task will not manipulate hop nodes
- get_broadcast_hosts will ignore hop nodes
- active_count also will ignore hop nodes
2021-12-17 14:30:28 -05:00
Jeff Bradberry
62d50d27be Update a couple of the existing tests 2021-11-10 08:50:12 +08:00
Alan Rominger
46ac9506e6 Assure consistent ordering with default IG first (#11034)
* Assure consistent ordering with default IG first

* Write conditional a little more defensively to pass tests
2021-09-08 11:11:46 -04:00
Alan Rominger
6a17e5b65b Allow manually running a health check, and make other adjustments to the health check trigger (#11002)
* Full finalize the planned work for health checks of execution nodes

* Implementation of instance health_check endpoint

* Also do version conditional to node_type

* Do not use receptor mesh to check main cluster nodes health

* Fix bugs from testing health check of cluster nodes, add doc

* Add a few fields to health check serializer missed before

* Light refactoring of error field processing

* Fix errors clearing error, write more unit tests

* Update health check info in docs

* Bump migration of health check after rebase

* Mark string for translation

* Add related health_check link for system auditors too

* Handle health_check cluster node timeout, add errors for peer judgement
2021-09-03 16:37:37 -04:00
Alan Rominger
5a6e9a06e2 Exclude control-only nodes from IG policy calculations (#10985)
* Exclude control-only nodes from IG policy calculations

Also, as a reverse to this, exclude execution-only nodes from
  the calculations if the group in question is the controlplane

* Incorporate review comments
2021-09-01 08:09:46 -04:00
Alan Rominger
68f79a1f3a Always use controlplane as project update backup IG (#10949)
* Always use controlplane as project update backup IG

Before, this was done conditionally to container_group jobs
this logic changes it so that controlgroup will always be a
firm backstop for project updates

* Code a little more defensively to make unit tests pass

* Fix unit tests
2021-08-30 14:23:09 -04:00
Alan Rominger
daf4310176 Clean up work_type processing and fix execution vs control capacity (#10930)
* Clean up added work_type processing for mesh_code branch

* track both execution and control capacity

* Remove unused execution_capacity property

* Count all forms of capacity to make test pass

* Force jobs to be on execution nodes, updates on control nodes

* Introduce capacity_type property to abstract some details out

* Update test to cover all job types at same time

* Register OpenShift nodes as control types

* Remove unqualified consumed_capacity from task manager and make unit tests work

* Remove unqualified consumed_capacity from task manager and make unit tests work

* Update unit test to execution vs control TM logic changes

* Fix bug, else handling for work_type method
2021-08-26 07:24:14 -04:00
Ryan Petrello
c2ef0a6500 move code linting to a stricter pep8-esque auto-formatting tool, black 2021-03-23 09:39:58 -04:00
Ryan Petrello
8b00b8c9c2 remove deprecated legacy manual inventory source support
see: https://github.com/ansible/awx/issues/6309
2020-04-03 10:54:43 -04:00
AlanCoding
aa4842aea5 Various JT.organization cleanup items
cleanup from PR review suggestions

bump migration number

fix test

revert change to old-app JT form no longer needed
2020-03-12 15:45:47 -04:00
AlanCoding
7d0b207571 Organization on JT as read-only field
Set JT.organization with value from its project

Remove validation requiring JT.organization

Undo some of the additional org definitions in tests

Revert some tests no longer needed for feature

exclude workflow approvals from unified organization field

revert awxkit changes for providing organization

Roll back additional JT creation permission requirement

Fix up more issues by persisting organization field when project is removed

Restrict project org editing, logging, and testing

Grant removed inventory org admin permissions in migration

Add special validate_unique for job templates
  this deals with enforcing name-organization uniqueness

Add back in special message where config is unknown
  when receiving 403 on job relaunch

Fix logical and performance bugs with data migration

within JT.inventory.organization make-permission-explicit migration

remove nested loops so we do .iterator() on JT queryset

in reverse migration, carefully remove execute role on JT
  held by org admins of inventory organization,
  as well as the execute_role holders

Use current state of Role model in logic, with 1 notable exception
  that is used to filter on ancestors
  the ancestor and descentent relationship in the migration model
    is not reliable
  output of this is saved as an integer list to avoid future
    compatibility errors

make the parents rebuilding logic skip over irrelevant models
  this is the largest performance gain for small resource numbers
2020-03-12 15:45:46 -04:00
AlanCoding
daa9282790 Initial (editable) pass of adding JT.organization
This is the old version of this feature from 2019
  this allows setting the organization in the data sent
  to the API when creating a JT, and exposes the field
  in the UI as well

Subsequent commit changes the field from editable
  to read-only, but as of this commit, the machinery
  is not hooked up to infer it from project
2020-03-12 15:45:46 -04:00
Ryan Petrello
f223df303f convert py2 -> py3 2019-01-15 14:09:01 -05:00
AlanCoding
a99ebbb02f Apply policy task more selectively 2018-08-03 06:56:15 -04:00
Ryan Petrello
3cdd0a94bd simplify dynamic queue binding
we recently made a change so that instances no longer bind to
instance-group specific queues, but now instead they each bind to
a direct queue for their specific hostname
(https://github.com/ansible/tower/pull/1922)

Because of this, we shouldn't *need* to reconfigure the queue binds at
runtime anymore when group membership changes. Under our new model,
every celeryd listens on a queue named after its hostname; when the
scheduler finds a task to run, it picks an Instance in the target
Instance Group and sends the task to the queue for that Instance's
hostname.
2018-07-28 00:48:37 -04:00
Ryan Petrello
15aaca8f03 make InstanceGroup.policy_instance_list non-exclusive by default
see: https://github.com/ansible/tower/issues/2583
2018-07-25 17:26:13 -04:00
Ryan Petrello
870adc14f9 fix a bug in the instance policy algorithm when both min and % are used
see: https://github.com/ansible/tower/issues/897
2018-05-24 12:28:46 -04:00
chris meyers
14c6265b27 ensure instance policy percentages round up 2018-04-25 10:11:40 -04:00
chris meyers
a56771c8f0 send all tower work to a user-hidden queue
* Before, we had a special group, tower, that ran any async work that
tower needed done. This allowed users fine grain control over which
nodes did background work. However, this granularity was too complicated
for users. So now, all tower system work goes to a special non-user
exposed celery queue. Tower remains the fallback instance group to
execute jobs on. The tower group will be created upon install and
protected from deletion.
2018-04-20 13:04:36 -04:00
chris meyers
838b723c73 add all instances to special tower instance group
* All instances except isolated instances
* Also, prevent any tower attributes from being modified via the API
2018-03-29 16:47:52 -04:00
Matthew Jones
4002f2071d Adding instance group policy unit tests
also remove async call for applying topology change
2018-02-07 11:14:53 -05:00
AlanCoding
aca1efa552 do not use inventory source instance groups 2017-08-04 13:11:59 -04:00
adamscmRH
7d2102102b Fix duplicate instances, & styling 2017-07-03 15:44:12 -04:00
adamscmRH
4b3b184fc8 Fix duplicate Instances in API 2017-07-03 15:23:39 -04:00
AlanCoding
fbf24492d8 Instance group preference order consistent with docs 2017-05-15 15:45:10 -04:00
Matthew Jones
5508bad97c Updating unit tests for task manager refactoring
* Purging old task manager unit tests
* Migrating those tests to functional tests
* Updating fixtures and factories to support a change in the way the
  task manager is tested
* Fix an issue with the mk_credential fixture when used in functional
  tests. Previously it had trouble with multiplel invocations when
  persistence was used
2017-05-10 12:45:37 -04:00