Commit Graph

151 Commits

Author SHA1 Message Date
Seth Foster
3e8202590c Remove Disconnected link state
Dynamically flipping from Established
to Disconnected is not the intended
usage of InstanceLink State.

- Link state starts in Adding and becomes
Established once any control node first sees the link
is in the status KnownConnectionCosts
2023-08-29 13:06:54 -04:00
Seth Foster
2bf6512a8e Do not change link state if Removing
inspect_established_receptor_connections should
not change link state is current state is Removing.

Other changes:
- rename inspect_execution_nodes to inspect_execution_and_hop_nodes
- Default link state is Adding
- Set min listener_port value to 1024
- inspect_established_receptor_connections now
runs as part of cluster_node_heartbeat task
2023-08-29 13:06:54 -04:00
Seth Foster
e41ad82687 optional listener port UI (#14300) 2023-08-29 13:06:54 -04:00
Seth Foster
ed2a59c1a3 receptor python packages 2023-08-29 13:06:54 -04:00
Seth Foster
2a51f23b7d Add functional API tests
add tests for calling write_receptor_config

add write_receptor_config test

Do not set default listener_port on control node
2023-08-29 13:06:54 -04:00
Jake Jackson
40fca6db57 [hop_node] Validate listener_port is defined for peers (#14056)
add peer listener_port validation and update install bundle if listener_port is defined or not defined.
2023-08-29 13:06:54 -04:00
Lorenzo Tanganelli
f7fdb7fe8d Add peers readonly api and instancelink constraint (#13916)
Add Disconnected link state

introspect_receptor_connections is a periodic
task that examines active receptor connections
and cross-checks it with the InstanceLink info.

Any links that should be active but are not
will be put into a Disconnected state. If
active, it will be in an Established state.

UI - Add hop creation and peers mgmt (#13922)

* add UI for mgmt peers, instance edit and add

* add peer info on detail and bug fix on detail

* remove unused chip and change peer label

* rename lookup, put Instance type disable on edit

---------

Co-authored-by: tanganellilore <lorenzo.tanagnelli@hotmail.it>
2023-08-29 13:06:54 -04:00
Seth Foster
d8abd4912b Add support in hop nodes in API 2023-08-29 13:06:54 -04:00
John Westcott IV
7e25a694f3 Making all non-complicated JSONBlobs JSONFields 2023-06-14 17:40:15 -04:00
Gabriel Muniz
a63067da38 Add instance groups roles (#13584)
* adding roles to instance groups
added ResourceMixin to Instancegroup and changed the filtered_queryset

* added necessary changes to rebuild relationship between IG and roles

* added description to InstanceGroupAccess

* preliminary ui plug for demo purposes

* preliminary ui plug for demo purposes
added inventory special logic for use_role to allow attaching instance groups
added more tests to handle those cases

* Add access_list to InstanceGroup

* scratch branch to test migration work

* refactored to shorten logic

* Added migration and am removing logic that enabled Org admin permissions

* Add Obj admin role to JT, Inv, Org

* Changed tests to reflect new permissions

* refactored some of the tests

* cleaned up more tests and reworded help on InstanceGroupAccess

* Removed unnecessary delete of Route for instance group perms change

* Fix UI tests and migration

* fixed permissions on prompt for InstanceGroups

* added related object roles endpoint

* added ui/api function for options instance_groups

* separate the migrations in order to avoid issues with migrations not being finished

* changed migrations parent class to disable the activity stream error in migrations

* Added logging to migration as activitystream is disabled

* added clarifying comment to jobtemlateaccess and linted UI addition

* renamed migrations to avoid collisions

* Rename migrations to avoid collisions
2023-03-14 21:37:22 -04:00
Alan Rominger
f5785976be Update to comply with new black rules 2023-02-01 14:59:38 -05:00
Elijah DeLee
e403c603d6 use task manager models more consistently in serializer 2022-11-30 17:14:33 -05:00
Elijah DeLee
86856f242a Add max concurrent jobs and max forks per ig
The intention of this feature is primarily to provide some notion of max
capacity of container groups, but the logic I've left generic. Default
is 0, which will be interpereted as no maximum number of jobs or forks.

Includes refactor of variable and method names for clarity.
instances_by_hostname is an internal attribute of TaskManagerInstances.
Clarify when we are expecting the actual TaskManagerInstances object.

Unify how we process running tasks and consume capacity. This has the
effect that we do less expensive work in after_lock_init and have 1 less
loop over all the running tasks. Previously we looped for both building
the dependency graph as well as for calculating the starting capacity of
all the instances and instance groups. Now we acheive both tasks in the
same loop.

Because of how this changes the somewhat subtle "do-si-do" of how to
initialize the Task Manager models, introduce a wrapper class that tries
to take some of that burden off of other areas where we re-use this like
in the serializer and the metrics. Also use this wrapper class to handle
nicities of how to track capacity consumption on instances and instance
groups.

Add tests for max_forks and max_concurrent_jobs

Fixup tests that use TaskManagerModels to accomodate changes.

assign ig before call to consume capacity

if we don't do it in that order, then we don't correctly account for
the container group jobs we are starting in the middle of the task
manager run
2022-11-30 17:14:33 -05:00
Alan Rominger
d571b9bbbc Refactor test_get_cleanup_task_kwargs_active_jobs and add new test
This takes some logic out of the queryset logic,
  using some established assumptions about the task manager
  if a job lands on a hybrid node (or is a project update) then
  it will have the same controller and execution node

With that established, the queryset can be simplified
2022-11-02 15:14:16 -04:00
Jeff Bradberry
65179d9cd0 Add a new Instance.health_check_started field
This will enable us to provide more useful information for the user,
now that all user-triggered health checks are async.

Also, de-bounce the health check endpoint to not allow additional
health check tasks to be triggered when one is already in progress.
2022-09-27 17:09:41 -04:00
Jeff Bradberry
08c18d71bf Move InstanceLink creation and updating to the async tasks
So that they get applied in situations that do not go through the API.
2022-09-23 09:46:14 -04:00
Seth Foster
eaa4f2483f Run instance health check in task container
awx-web container does not have access to receptor socket, and the
execution node health check requires receptorctl.

This change runs the health check asynchronously in the task container.
2022-09-23 09:46:14 -04:00
Jeff Bradberry
1b650d6927 When deprovisioning a node, kick off a task that waits on running jobs
After all jobs on the node are complete, delete the node then
broadcast the write_receptor_config task.

Also, make sure that write_receptor_config updates the state of links
that are in 'adding' state.
2022-09-23 09:46:13 -04:00
Jeff Bradberry
3bc86ca8cb Follow up on new execution node creation
- hop nodes are descoped
- links need to be created on execution node creation
- expose the 'edit' capabilities on the instance serializer
2022-09-23 09:46:13 -04:00
Shane McDonald
9b034ad574 generate control node receptor.conf
when a new remote execution/hop node is added
regenerate the receptor.conf for all control node to
peer out to the new remote execution node

Signed-off-by: Hao Liu <haoli@redhat.com>
Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com>
Co-Authored-By: Shane McDonald <me@shanemcd.com>
2022-09-23 09:46:12 -04:00
Jeff Bradberry
3bcd539b3d Make sure that the health checks handle the state transitions properly
- nodes with states Provisioning, Provisioning Fail, Deprovisioning,
  and Deprovisioning Fail should bypass health checks and should never
  transition due to the existing machinery
- nodes with states Unavailable and Installed can transition to Ready
  if they check out as healthy
- nodes in the Ready state should transition to Unavailable if they
  fail a check
2022-09-23 09:46:11 -04:00
Jeff Bradberry
2fba3db48f Add state fields to Instance and InstanceLink
Also, listener_port to Instance.
2022-09-23 09:46:11 -04:00
Alan Rominger
61093b2532 Treat instance_groups prompt as template-less 2022-09-22 16:08:22 -04:00
Alan Rominger
68e11d2b81 Add WorkflowJob.instance_groups and distinguish from char_prompts
This removes a loop that ran on import
  the loop was giving the wrong behavior
  and it initialized too many fields as char_prompts fields

With this, we will now enumerate the char_prompts type fields manually
2022-09-22 15:39:49 -04:00
John Westcott IV
4f5596eb0c Adding unit/functional tests, fixing tests
Making common class for LabelList

Fixing related field name

Fixing get_effective_slice_ct to look for corerct field and also override _eager_field
2022-09-22 15:39:16 -04:00
John Westcott IV
809df74050 Adding EE/IG/labels/forks/timeout/job_slice_count to schedules
Modifying schedules to work with related fields

Updating awx.awx.workflow_job_template_node
2022-09-22 15:35:27 -04:00
John Westcott IV
33c0fb79d6 JT param everything (#12646)
* Making almost all fields promptable on job templates and config models
* Adding EE, IG and label access checks
* Changing jobs preferred instance group function to handle the new IG cache field
* Adding new ask fields to job template modules
* Address unit/functional tests
* Adding migration file
2022-09-22 15:16:12 -04:00
Shane McDonald
c5976e2584 Add setting for missed heartbeats before marking node offline 2022-08-17 11:39:30 -04:00
Seth Foster
e6f8852b05 Cache task_impact
task_impact is now a field on the database
It is calculated and set during create_unified_job

set task_impact on .save for adhoc commands
2022-08-05 14:33:47 -04:00
Alan Rominger
cb63d92bbf Remove committed_capacity field, delete supporting code (#12086)
* Remove committed_capacity field, delete supporting code

* Track consumed capacity to solve the negatives problem

* Use more verbose name for IG queryset
2022-04-22 13:41:32 -04:00
Elijah DeLee
689a216726 move static methods used by task manager (#12050)
* move static methods used by task manager

These static methods were being used to act on Instance-like objects
that were SimpleNamespace objects with the necessary attributes.

This change introduces dedicated classes to replace the SimpleNamespace
objects and moves the formerlly staticmethods to a place where they are
more relevant instead of tacked onto models to which they were only
loosly related.

Accept in-memory data structure in init methods for tests

* initialize remaining capacity AFTER we built map of instances
2022-04-21 13:05:06 -04:00
Elijah DeLee
2e9974133a calculate remaining capacity in static method
this is to avoid additional queries when we allready have all
the active jobs fetched in the task manager
2022-04-13 11:56:07 -04:00
Elijah DeLee
4328b4cb67 drop call that queries all running and waiting jobs
this is to fix one more place in the task manager where we end up
querying all running and waiting jobs.

Partial fix for https://github.com/ansible/awx/issues/11671
2022-04-12 10:31:47 -04:00
Alan Rominger
3d22c8ae91 Merge pull request #11968 from AlanCoding/cleanup_tweaks
Minor tweaks to ansible-runner cleanup task arguments
2022-03-29 15:00:33 -04:00
Alan Rominger
deac08ba8a Add regression test for overly agressive cleanup behavior 2022-03-28 22:23:33 -04:00
Jeff Bradberry
6c1adade25 Merge pull request #11947 from jbradberry/django-3.2-upgrade
Remove the out-of-band JSONField migration
2022-03-28 12:02:53 -04:00
Alan Rominger
85ec83c3fd Minor tweaks to ansible-runner cleanup task arguments 2022-03-28 10:52:09 -04:00
Lucas Dias
01ce3440eb added os.path and module import 2022-03-25 14:26:00 +01:00
Jeff Bradberry
e3f3ab224a Replace all previously text-based json fields with JSONBlob
This JSONBlob field type is a wrapper around Django's new generic
JSONField, but with the database column type forced to be text.  This
should behave close enough to our old wrapper around
django-jsonfield's JSONField and will avoid needing to do the
out-of-band database migration.
2022-03-24 15:21:54 -04:00
Lucas Dias
18b1440d7c fixed hardcode tmp ha.py 2022-03-24 17:59:43 +01:00
Jeff Bradberry
ac6a82eee4 Merge pull request #11654 from jbradberry/django-3.2-upgrade
Django 3.2 upgrade
2022-03-17 10:34:22 -04:00
Alan Rominger
99bbc347ec Fill in errors for hop nodes when Last Seen is out of date, and clear them when not (#11714)
* Process unresponsive and newly responsive hop nodes

* Use more natural way to zero hop node capacity, add test

* Use warning as opposed to warn for log messages
2022-03-09 13:21:32 -05:00
Jeff Bradberry
05142a779d Replace all usage of customized json fields with the Django builtin
The event_data field on event models, however, is getting an
overridden version that retains the underlying text data type for the
column, to avoid a heavy data migration on those tables.

Also, certain of the larger tables are getting these fields with the
NOT NULL constraint turned off, to avoid a long migration.

Remove the django.utils.six monkey patch we did at the beginning of
the upgrade.
2022-03-07 18:11:36 -05:00
Jeff Bradberry
b852baaa39 Fix up logger .warn() calls to use .warning() instead
This is a usage that was deprecated in Python 3.0.
2022-03-07 18:11:36 -05:00
Jeff Bradberry
a3a216f91f Fix up new Django 3.0 deprecations
Mostly text based: force/smart_text, ugettext_*
2022-03-07 18:11:36 -05:00
Elijah DeLee
799968460d Fixup conversion of memory and cpu settings to support k8s resource request format (#11725)
fix memory and cpu settings to suport k8s resource request format

* fix conversion of memory setting to bytes

This setting has not been getting set by default, and needed some fixing
up to be compatible with setting the memory in the same way as we set it
in the operator, as well as with other changes from last year which
assume that ansible runner is returning memory in bytes.

This way we can start setting this setting in the operator, and get a
more accurate reflection of how much memory is available to the control
pod in k8s.

On platforms where services are all sharing memory, we deduct a
penalty from the memory available. On k8s we don't need to do this
because the web, redis, and task containers each have memory
allocated to them.

* Support CPU setting expressed in units used by k8s

This setting has not been getting set by default, and needed some fixing
up to be compatible with setting the CPU resource request/limits in the
same way as we set it in the resource requests/limits.

This way we can start setting this setting in the
operator, and get a more accurate reflection of how much cpu is
available to the control pod in k8s.

Because cpu on k8s can be partial cores, migrate cpu field to decimal.

k8s does not allow granularity of less than 100m (equivalent to 0.1 cores), so only
store up to 1 decimal place.

fix analytics to deal with decimal cpu

need to use DjangoJSONEncoder when Decimal fields in data passed to
json.dumps
2022-02-15 14:08:24 -05:00
Elijah DeLee
604cbc1737 Consume control capacity (#11665)
* Select control node before start task

Consume capacity on control nodes for controlling tasks and consider
remainging capacity on control nodes before selecting them.

This depends on the requirement that control and hybrid nodes should all
be in the instance group named 'controlplane'. Many tests do not satisfy that
requirement. I'll update the tests in another commit.

* update tests to use controlplane

We don't start any tasks if we don't have a controlplane instance group

Due to updates to fixtures, update tests to set node type and capacity
explicitly so they get expected result.

* Fixes for accounting of control capacity consumed

Update method is used to account for currently consumed capacity for
instance groups in the in-memory capacity tracking data structure we initialize in
after_lock_init and then update via calculate_capacity_consumed (both in
task_manager.py)

Also update fit_task_to_instance to consider control impact on instances

Trust that these functions do the right thing looking for a
node with capacity, and cut out redundant check for the whole group's
capacity per Alan's reccomendation.

* Refactor now redundant code

Deal with control type tasks before we loop over the preferred instance
groups, which cuts out the need for some redundant logic.

Also, fix a bug where I was missing assigning the execution node in one case!

* set job explanation on tasks that need capacity

move the job explanation for jobs that need capacity to a function
so we can re-use it in the three places we need it.

* project updates always run on the controlplane

Instance group ordering makes no sense on project updates because they
always need to run on the control plane.

Also, since hybrid nodes should always run the control processes for the
jobs running on them as execution nodes, account for this when looking for a
execution node.

* fix misleading message

the variables and wording were both misleading, fix to be more accurate
description in the two different cases where this log may be emitted.

* use settings correctly

use settings.DEFAULT_CONTROL_PLANE_QUEUE_NAME instead of a hardcoded
name
cache the controlplane_ig object during the after lock init to avoid
an uneccesary query
eliminate mistakenly duplicated AWX_CONTROL_PLANE_TASK_IMPACT and use
only AWX_CONTROL_NODE_TASK_IMPACT

* add test for control capacity consumption

add test to verify that when there are 2 jobs and only capacity for one
that one will move into waiting and the other stays in pending

* add test for hybrid node capacity consumption

assert that the hybrid node is used for both control and execution and
capacity is deducted correctly

* add test for task.capacity_type = control

Test that control type tasks have the right capacity consumed and
get assigned to the right instance group

Also fix lint in the tests

* jobs_running not accurate for control nodes

We can either NOT use "idle instances" for control nodes, or we need
to update the jobs_running property on the Instance model to count
jobs where the node is the controller_node.

I didn't do that because it may be an expensive query, and it would be
hard to make it match with jobs_running on the InstanceGroup which
filters on tasks assigned to the instance group.

This change chooses to stop considering "idle" control nodes an option,
since we can't acurrately identify them.

The way things are without any change, is we are continuing to over consume capacity on control nodes
because this method sees all control nodes as "idle" at the beginning
of the task manager run, and then only counts jobs started in that run
in the in-memory tracking. So jobs which last over a number of task
manager runs build up consuming capacity, which is accurately reported
via Instance.consumed_capacity

* Reduce default task impact for control nodes

This is something we can experiment with as far as what users
want at install time, but start with just 1 for now.

* update capacity docs

Describe usage of the new setting and the concept of control impact.

Co-authored-by: Alan Rominger <arominge@redhat.com>
Co-authored-by: Rebeccah <rhunter@redhat.com>
2022-02-14 10:13:22 -05:00
Alan Rominger
285ff080d0 Prevent duplicate query in local health check 2022-01-27 15:27:07 -05:00
Jeff Bradberry
334c33ca07 Handle receptorctl advertisements for hop nodes
counting it towards their heartbeat.  Also, leave off the link to the
health check endpoint from hop node Instances.
2022-01-24 16:51:45 -05:00
Amol Gautam
a4a3ba65d7 Refactored tasks.py to a package
--- Added 3 new sub-package : awx.main.tasks.system , awx.main.tasks.jobs , awx.main.tasks.receptor
--- Modified the functional tests and unit tests accordingly
2022-01-14 11:55:41 -05:00