- the list, detail, and health check API views should not include them
- the Instance-InstanceGroup association views should not allow them
to be changed
- the ping view excludes them
- list_instances management command excludes them
- Instance.set_capacity_value sets hop nodes to 0 capacity
- TaskManager will exclude them from the nodes available for job execution
- TaskManager.reap_jobs_from_orphaned_instances will consider hop nodes
to be an orphaned instance
- The apply_cluster_membership_policies task will not manipulate hop nodes
- get_broadcast_hosts will ignore hop nodes
- active_count also will ignore hop nodes
Add new logic to cleanup orphaned work units
from administrative tasks
Remove noisy log which is often irrelevant
about running-cleanup-on-execution-nodes
we already have other logs for this
* Primary development of integrating runner cleanup command
* Fixup image cleanup signals and their tests
* Use alphabetical sort to solve the cluster coordination problem
* Update test to new pattern
* Clarity edits to interface with ansible-runner cleanup method
* Another change corresponding to ansible-runner CLI updates
* Fix incomplete implementation of receptor remote cleanup
* Share receptor utils code between worker_info and cleanup
* Complete task logging from calling runner cleanup command
* Wrap up unit tests and some contract changes that fall out of those
* Fix bug in CLI construction
* Fix queryset filter bug
* Full finalize the planned work for health checks of execution nodes
* Implementation of instance health_check endpoint
* Also do version conditional to node_type
* Do not use receptor mesh to check main cluster nodes health
* Fix bugs from testing health check of cluster nodes, add doc
* Add a few fields to health check serializer missed before
* Light refactoring of error field processing
* Fix errors clearing error, write more unit tests
* Update health check info in docs
* Bump migration of health check after rebase
* Mark string for translation
* Add related health_check link for system auditors too
* Handle health_check cluster node timeout, add errors for peer judgement
* Clean up added work_type processing for mesh_code branch
* track both execution and control capacity
* Remove unused execution_capacity property
* Count all forms of capacity to make test pass
* Force jobs to be on execution nodes, updates on control nodes
* Introduce capacity_type property to abstract some details out
* Update test to cover all job types at same time
* Register OpenShift nodes as control types
* Remove unqualified consumed_capacity from task manager and make unit tests work
* Remove unqualified consumed_capacity from task manager and make unit tests work
* Update unit test to execution vs control TM logic changes
* Fix bug, else handling for work_type method
* Model changes for instance last_seen field to replace modified
* Break up refresh_capacity into smaller units
* Rename execution node methods, fix last_seen clustering
* Use update_fields to make it clear save only affects capacity
* Restructing to pass unit tests
* Fix bug where a PATCH did not update capacity value
* Introduce utilities for --worker-info health check integration
* Handle case where ansible-runner is not installed
* Add ttl parameter for health check
* Reformulate return data structure and add lots of error cases
* Move up the cleanup tasks, close sockets
* Integrate new --worker-info into the execution node capacity check
* Undo the raw value override from the PoC
* Additional refinement to execution node check frequency
* Put in more complete network diagram
* Followup on comment to remove modified from from health check responsibilities
This requires swapping out the container images
for the execution nodes from awx-ee to the awx image
For completeness, the hop node image is switched to the raw
receptor image
A few outright bugs are fixed here
memory calculation just was not right at all
the execution_capacity calculation was reverse of intention
Drop in a few TODOs about error handling from debugging
- In K8S-based installs, only container groups are intended to be used
for playbook execution (JTs, adhoc, inventory updates), so in this
scenario, other job types have a task impact of zero.
- In K8S-based installs, traditional instances have *zero* capacity
(because they're only members of the control plane where services
- http/s, local control plane execution - run)
- This commit also includes some changes that allow for the task manager
to launch tasks with task_impact=0 on instances that have capacity=0
(previously, an instance with zero capacity would never be selected
as the "execution node"
This means that when IS_K8S=True, any Job Template associated with an
Instance Group will never actually go from pending -> running (because
there's no capacity - all playbooks must run through Container Groups).
For an improved ux, our intention is to introduce logic into the
operator install process such that the *default* group that's created at
install time is a *Container Group* that's configured to point at the
K8S cluster where awx itself is deployed.
* Do not query the database for the set of Instance that belong to the
group for which we are trying to fit a job on, for each job.
* Instead, cache the set of instances per-instance group.
* Disabling an instance is used to stop and instance from being the
target of new jobs to run.
* The instance should still perform it's heartbeat so that it isn't
considered offline.
* If the instance was allowed to go offline on an openshift cluster it
would be deleted from the database.
Added two new properties to the InstanceGroup model - `is_controller`
and `is_isolated`. Used these properties to hide the trash icon for
instance groups that are either controller or isolated.
Signed-off-by: Vismay Golwala <vgolwala@redhat.com>
this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:
- a new decorator, `awx.main.dispatch.task`, which can be used to
decorate functions or classes so that they can be designated as
"Tasks"
- support for fanout/broadcast tasks (at this point in time, only
`conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
`handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
on demand
- minimal support for RPC, such as status checks and pool recycle/reload