* Unsure exactly why this happens but there seems to be a race condition
related to the time window between Receptor work_results and work
release. This sleep extends that window and hopefully avoids the race
condition.
Add new logic to cleanup orphaned work units
from administrative tasks
Remove noisy log which is often irrelevant
about running-cleanup-on-execution-nodes
we already have other logs for this
* Primary development of integrating runner cleanup command
* Fixup image cleanup signals and their tests
* Use alphabetical sort to solve the cluster coordination problem
* Update test to new pattern
* Clarity edits to interface with ansible-runner cleanup method
* Another change corresponding to ansible-runner CLI updates
* Fix incomplete implementation of receptor remote cleanup
* Share receptor utils code between worker_info and cleanup
* Complete task logging from calling runner cleanup command
* Wrap up unit tests and some contract changes that fall out of those
* Fix bug in CLI construction
* Fix queryset filter bug
-- Updated devel build to take most recent receptor binary
-- Added signWork parameter when sedning job to receptor
-- Modified docker-compose tasks to generate RSA key pair to use for work-signing
-- Modified docker-compose templates and jinja templates for implementing work-sign
-- Modified Firewall rules on the receptor jinja config
Add firewall rules to dev env
For project updates jobs triggered due a job template run,
we must ensure that project_update job to run on at the same
controller which dispatched the original job template, otherwise
the job might fail for being unable to find the playbook YAML file.
* Full finalize the planned work for health checks of execution nodes
* Implementation of instance health_check endpoint
* Also do version conditional to node_type
* Do not use receptor mesh to check main cluster nodes health
* Fix bugs from testing health check of cluster nodes, add doc
* Add a few fields to health check serializer missed before
* Light refactoring of error field processing
* Fix errors clearing error, write more unit tests
* Update health check info in docs
* Bump migration of health check after rebase
* Mark string for translation
* Add related health_check link for system auditors too
* Handle health_check cluster node timeout, add errors for peer judgement
* Exclude control-only nodes from IG policy calculations
Also, as a reverse to this, exclude execution-only nodes from
the calculations if the group in question is the controlplane
* Incorporate review comments
* Clean up added work_type processing for mesh_code branch
* track both execution and control capacity
* Remove unused execution_capacity property
* Count all forms of capacity to make test pass
* Force jobs to be on execution nodes, updates on control nodes
* Introduce capacity_type property to abstract some details out
* Update test to cover all job types at same time
* Register OpenShift nodes as control types
* Remove unqualified consumed_capacity from task manager and make unit tests work
* Remove unqualified consumed_capacity from task manager and make unit tests work
* Update unit test to execution vs control TM logic changes
* Fix bug, else handling for work_type method
* Model changes for instance last_seen field to replace modified
* Break up refresh_capacity into smaller units
* Rename execution node methods, fix last_seen clustering
* Use update_fields to make it clear save only affects capacity
* Restructing to pass unit tests
* Fix bug where a PATCH did not update capacity value
* Introduce utilities for --worker-info health check integration
* Handle case where ansible-runner is not installed
* Add ttl parameter for health check
* Reformulate return data structure and add lots of error cases
* Move up the cleanup tasks, close sockets
* Integrate new --worker-info into the execution node capacity check
* Undo the raw value override from the PoC
* Additional refinement to execution node check frequency
* Put in more complete network diagram
* Followup on comment to remove modified from from health check responsibilities
This requires swapping out the container images
for the execution nodes from awx-ee to the awx image
For completeness, the hop node image is switched to the raw
receptor image
A few outright bugs are fixed here
memory calculation just was not right at all
the execution_capacity calculation was reverse of intention
Drop in a few TODOs about error handling from debugging