Commit Graph

151 Commits

Author SHA1 Message Date
Jeff Bradberry
f340f491dc Control the visibility and use of hop node Instances
- the list, detail, and health check API views should not include them
- the Instance-InstanceGroup association views should not allow them
  to be changed
- the ping view excludes them
- list_instances management command excludes them
- Instance.set_capacity_value sets hop nodes to 0 capacity
- TaskManager will exclude them from the nodes available for job execution
- TaskManager.reap_jobs_from_orphaned_instances will consider hop nodes
  to be an orphaned instance
- The apply_cluster_membership_policies task will not manipulate hop nodes
- get_broadcast_hosts will ignore hop nodes
- active_count also will ignore hop nodes
2021-12-17 14:30:28 -05:00
Jeff Bradberry
c8f1e714e1 Capture hop nodes and the peer links between nodes 2021-12-17 14:30:18 -05:00
Alan Rominger
7b35902d33 Respect settings to keep files and work units
Add new logic to cleanup orphaned work units
  from administrative tasks

Remove noisy log which is often irrelevant
  about running-cleanup-on-execution-nodes
  we already have other logs for this
2021-11-10 08:50:10 +08:00
Alan Rominger
b70793db5c Consolidate cleanup actions under new ansible-runner worker cleanup command (#11160)
* Primary development of integrating runner cleanup command

* Fixup image cleanup signals and their tests

* Use alphabetical sort to solve the cluster coordination problem

* Update test to new pattern

* Clarity edits to interface with ansible-runner cleanup method

* Another change corresponding to ansible-runner CLI updates

* Fix incomplete implementation of receptor remote cleanup

* Share receptor utils code between worker_info and cleanup

* Complete task logging from calling runner cleanup command

* Wrap up unit tests and some contract changes that fall out of those

* Fix bug in CLI construction

* Fix queryset filter bug
2021-10-05 16:32:03 -04:00
Alan Rominger
3fc63489f1 Filter controller_node selection to online nodes (#11120) 2021-09-24 23:01:32 -04:00
Rebeccah
55f2125a51 if the user provides a uuid and it exists, allow that to tie to the instance, which allows the user to update the instance based on the UUID (includeding updating the hostname) should they choose to do so. 2021-09-21 16:54:11 -04:00
Alan Rominger
6a17e5b65b Allow manually running a health check, and make other adjustments to the health check trigger (#11002)
* Full finalize the planned work for health checks of execution nodes

* Implementation of instance health_check endpoint

* Also do version conditional to node_type

* Do not use receptor mesh to check main cluster nodes health

* Fix bugs from testing health check of cluster nodes, add doc

* Add a few fields to health check serializer missed before

* Light refactoring of error field processing

* Fix errors clearing error, write more unit tests

* Update health check info in docs

* Bump migration of health check after rebase

* Mark string for translation

* Add related health_check link for system auditors too

* Handle health_check cluster node timeout, add errors for peer judgement
2021-09-03 16:37:37 -04:00
Jim Ladd
262cd3c695 set default uuid 2021-09-03 10:05:15 -07:00
Alan Rominger
424dbe8208 Use ansible-runner imports for cpu and memory calculation (#10954)
* Use ansible-runner imports for cpu and memory calculation

* Fix bug with capacity and memory adjustment
2021-08-27 21:46:53 -04:00
Alan Rominger
daf4310176 Clean up work_type processing and fix execution vs control capacity (#10930)
* Clean up added work_type processing for mesh_code branch

* track both execution and control capacity

* Remove unused execution_capacity property

* Count all forms of capacity to make test pass

* Force jobs to be on execution nodes, updates on control nodes

* Introduce capacity_type property to abstract some details out

* Update test to cover all job types at same time

* Register OpenShift nodes as control types

* Remove unqualified consumed_capacity from task manager and make unit tests work

* Remove unqualified consumed_capacity from task manager and make unit tests work

* Update unit test to execution vs control TM logic changes

* Fix bug, else handling for work_type method
2021-08-26 07:24:14 -04:00
Alan Rominger
940c189c12 Corresponding AWX changes for runner --worker-info schema update (#10926) 2021-08-24 08:41:36 -04:00
Alan Rominger
928c35ede5 Model changes for instance last_seen field to replace modified (#10870)
* Model changes for instance last_seen field to replace modified

* Break up refresh_capacity into smaller units

* Rename execution node methods, fix last_seen clustering

* Use update_fields to make it clear save only affects capacity

* Restructing to pass unit tests

* Fix bug where a PATCH did not update capacity value
2021-08-24 08:41:35 -04:00
Alan Rominger
3b1e40d227 Use the ansible-runner worker --worker-info to perform execution node capacity checks (#10825)
* Introduce utilities for --worker-info health check integration

* Handle case where ansible-runner is not installed

* Add ttl parameter for health check

* Reformulate return data structure and add lots of error cases

* Move up the cleanup tasks, close sockets

* Integrate new --worker-info into the execution node capacity check

* Undo the raw value override from the PoC

* Additional refinement to execution node check frequency

* Put in more complete network diagram

* Followup on comment to remove modified from from health check responsibilities
2021-08-24 08:41:35 -04:00
Alan Rominger
f47eb126e2 Adopt the node_type field in receptor logic (#10802)
* Adopt the node_type field in receptor logic

* Refactor Instance.objects.register so we do not reset capacity to 0
2021-08-24 08:41:34 -04:00
Alan Rominger
b53d3bc81d Undo some things not compatible with hybrid node hack (#10763) 2021-08-24 08:41:34 -04:00
Alan Rominger
9881bb72b8 Treat the awx_1 node as a hybrid node for now, use local work type (#10726) 2021-08-24 08:40:21 -04:00
Alan Rominger
f597205fa7 Run capacity checks with container isolation (#10688)
This requires swapping out the container images
  for the execution nodes from awx-ee to the awx image

For completeness, the hop node image is switched to the raw
  receptor image

A few outright bugs are fixed here
  memory calculation just was not right at all
  the execution_capacity calculation was reverse of intention

Drop in a few TODOs about error handling from debugging
2021-08-24 08:40:19 -04:00
Alan Rominger
13300bdbd4 Update rebase to keep old control plane capacity check
Also do some basic work to separate control versus execution capacity
  this is to assure that we don't send jobs to the control node
2021-08-24 08:40:19 -04:00
Alan Rominger
39e23db523 Make minor changes to add needed imports 2021-08-24 08:40:19 -04:00
Ryan Petrello
05cb876df5 implement an initial development environment for receptor-based clusters 2021-08-24 08:40:18 -04:00
Rebeccah
706f3f97ea add a new field to the instance model for use with receptor changes (incoming) 2021-07-26 15:53:56 -04:00
Shane McDonald
ec8ac6f1a7 Introduce distinct controlplane instance group 2021-06-07 11:25:59 -04:00
Jeff Bradberry
1819a7963a Make the necessary changes to the models
- remove InstanceGroup.controller
- remove Instance.last_isolated_check
- remove .is_isolated and .is_controller methods/properties
- remove .choose_online_controller_node() method
- remove .supports_isolation() and replace with .can_run_containerized
- simplify .can_run_containerized
2021-04-22 10:17:02 -04:00
Ryan Petrello
c2ef0a6500 move code linting to a stricter pep8-esque auto-formatting tool, black 2021-03-23 09:39:58 -04:00
Shane McDonald
1c4a376758 Explicit db field for is_container_group
We now have Container Groups that dont require a credential.
2021-03-15 13:28:39 -04:00
Shane McDonald
57b317d440 Get system jobs working under new deployment model (#9221) 2021-03-03 18:52:55 -05:00
Ryan Petrello
f850f8d3e0 introduce a new global flag for denoating K8S-based deployments
- In K8S-based installs, only container groups are intended to be used
  for playbook execution (JTs, adhoc, inventory updates), so in this
  scenario, other job types have a task impact of zero.
- In K8S-based installs, traditional instances have *zero* capacity
  (because they're only members of the control plane where services
  - http/s, local control plane execution - run)
- This commit also includes some changes that allow for the task manager
  to launch tasks with task_impact=0 on instances that have capacity=0
  (previously, an instance with zero capacity would never be selected
  as the "execution node"

This means that when IS_K8S=True, any Job Template associated with an
Instance Group will never actually go from pending -> running (because
there's no capacity - all playbooks must run through Container Groups).
For an improved ux, our intention is to introduce logic into the
operator install process such that the *default* group that's created at
install time is a *Container Group* that's configured to point at the
K8S cluster where awx itself is deployed.
2021-03-03 18:52:55 -05:00
Shane McDonald
286b1d4e25 InstanceGroup#is_containerized -> InstanceGroup#is_container_group 2021-03-03 18:52:55 -05:00
Chris Meyers
2eac5a8873 reduce per-job database query count
* Do not query the database for the set of Instance that belong to the
group for which we are trying to fit a job on, for each job.
* Instead, cache the set of instances per-instance group.
2020-10-19 10:54:56 -04:00
Bill Nottingham
05ad85e7a6 Remove the model for the now unused TowerAnalyticsState. 2020-09-09 20:18:04 -04:00
Ryan Petrello
b01d204137 if redis is unreachable, set instance capacity to zero 2020-09-03 15:11:53 -04:00
Shane McDonald
45ce6d794e Initial migration of rabbitmq -> redis for k8s installs 2020-03-18 16:10:17 -04:00
Rebeccah
1f05372ac9 change the logic to not break existing policy_instance testing 2019-11-12 13:13:34 -05:00
Rebeccah
d0327fc044 added onto the when saved function for instance groups that sets policy variables to their default. 2019-11-12 13:13:34 -05:00
Shane McDonald
bd5003ca98 Task manager / scheduler Kubernetes integration 2019-10-04 13:21:21 -04:00
Shane McDonald
a9059edc65 Allow associating a credential with an instance group 2019-10-04 12:54:31 -04:00
Jim Ladd
cdcf2fa4c2 Increase instance version length 2019-10-03 21:24:07 -04:00
AlanCoding
fedd1cf22f Replace JobOrigin with ActivityStream.action_node 2019-05-31 07:10:07 -04:00
softwarefactory-project-zuul[bot]
874465a2d4 Merge pull request #3865 from chrismeyersfsu/fix-enabled_still_online
disabled instance does not mean offline instance

Reviewed-by: https://github.com/softwarefactory-project-zuul[bot]
2019-05-23 16:55:09 +00:00
AlanCoding
f4c18843a3 Resolve default ordering warnings from tests 2019-05-20 10:58:36 -04:00
chris meyers
8aa28092ff disabled instance does not mean offline instance
* Disabling an instance is used to stop and instance from being the
target of new jobs to run.
* The instance should still perform it's heartbeat so that it isn't
considered offline.
* If the instance was allowed to go offline on an openshift cluster it
would be deleted from the database.
2019-05-13 11:44:47 -04:00
Ryan Petrello
e4a50f3595 enforce a stable list order when attaching/detaching instance groups 2019-05-07 14:53:00 -04:00
softwarefactory-project-zuul[bot]
fad0274373 Merge pull request #3686 from vismay-golwala/instance_group_delete
[WIP] Disallow deleting controller or isolated instance groups

Reviewed-by: https://github.com/softwarefactory-project-zuul[bot]
2019-04-24 15:19:19 +00:00
AlanCoding
8c2b3e9b84 Fix Django 2.0 deprecation warnings 2019-04-22 14:17:14 -04:00
Vismay Golwala
e0c4fd4b3a Disallow deleting controller or isolated instance groups
Added two new properties to the InstanceGroup model - `is_controller`
and `is_isolated`. Used these properties to hide the trash icon for
instance groups that are either controller or isolated.

Signed-off-by: Vismay Golwala <vgolwala@redhat.com>
2019-04-15 16:08:27 -04:00
Ryan Petrello
c586fa9821 add a minimal framework for generating analytics/metrics
annotate queries & add license analytics
2019-03-27 19:53:00 -04:00
Bianca
f1e3be5ec8 Removing unicode fix-related lines in ha.py 2019-01-17 14:42:38 -05:00
Ryan Petrello
f223df303f convert py2 -> py3 2019-01-15 14:09:01 -05:00
Ryan Petrello
ff1e8cc356 replace celery task decorators with a kombu-based publisher
this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:

- a new decorator, `awx.main.dispatch.task`, which can be used to
  decorate functions or classes so that they can be designated as
  "Tasks"
- support for fanout/broadcast tasks (at this point in time, only
  `conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
  `handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
  on demand
- minimal support for RPC, such as status checks and pool recycle/reload
2018-10-11 10:53:30 -04:00
Ryan Petrello
67d1267d98 enforce 0 <= Instance.capacity_adjustment
see: https://github.com/ansible/tower/issues/2839
2018-08-21 15:34:19 -04:00