* Unsure exactly why this happens but there seems to be a race condition
related to the time window between Receptor work_results and work
release. This sleep extends that window and hopefully avoids the race
condition.
Add new logic to cleanup orphaned work units
from administrative tasks
Remove noisy log which is often irrelevant
about running-cleanup-on-execution-nodes
we already have other logs for this
Skip checking the health of a mesh instance when the instance is not registered
with the application. This prevents encountering an 'UnbouncLocalError' when
running the application attached to a multi-use Receptor mesh network
Signed-off-by: Ethan Paul <24588726+enpaul@users.noreply.github.com>
* Primary development of integrating runner cleanup command
* Fixup image cleanup signals and their tests
* Use alphabetical sort to solve the cluster coordination problem
* Update test to new pattern
* Clarity edits to interface with ansible-runner cleanup method
* Another change corresponding to ansible-runner CLI updates
* Fix incomplete implementation of receptor remote cleanup
* Share receptor utils code between worker_info and cleanup
* Complete task logging from calling runner cleanup command
* Wrap up unit tests and some contract changes that fall out of those
* Fix bug in CLI construction
* Fix queryset filter bug
-- Updated devel build to take most recent receptor binary
-- Added signWork parameter when sedning job to receptor
-- Modified docker-compose tasks to generate RSA key pair to use for work-signing
-- Modified docker-compose templates and jinja templates for implementing work-sign
-- Modified Firewall rules on the receptor jinja config
Add firewall rules to dev env
For project updates jobs triggered due a job template run,
we must ensure that project_update job to run on at the same
controller which dispatched the original job template, otherwise
the job might fail for being unable to find the playbook YAML file.
* Full finalize the planned work for health checks of execution nodes
* Implementation of instance health_check endpoint
* Also do version conditional to node_type
* Do not use receptor mesh to check main cluster nodes health
* Fix bugs from testing health check of cluster nodes, add doc
* Add a few fields to health check serializer missed before
* Light refactoring of error field processing
* Fix errors clearing error, write more unit tests
* Update health check info in docs
* Bump migration of health check after rebase
* Mark string for translation
* Add related health_check link for system auditors too
* Handle health_check cluster node timeout, add errors for peer judgement
* Exclude control-only nodes from IG policy calculations
Also, as a reverse to this, exclude execution-only nodes from
the calculations if the group in question is the controlplane
* Incorporate review comments
* Clean up added work_type processing for mesh_code branch
* track both execution and control capacity
* Remove unused execution_capacity property
* Count all forms of capacity to make test pass
* Force jobs to be on execution nodes, updates on control nodes
* Introduce capacity_type property to abstract some details out
* Update test to cover all job types at same time
* Register OpenShift nodes as control types
* Remove unqualified consumed_capacity from task manager and make unit tests work
* Remove unqualified consumed_capacity from task manager and make unit tests work
* Update unit test to execution vs control TM logic changes
* Fix bug, else handling for work_type method