For project updates jobs triggered due a job template run,
we must ensure that project_update job to run on at the same
controller which dispatched the original job template, otherwise
the job might fail for being unable to find the playbook YAML file.
* Write tests to assure air-tightness of overwrite
* Candidate fix for group overwrite air-tightness
* Another proposed fix for the id mapping
* Further double down on tracking old instance_id
* Separate unchanging data case and fix some test issues
* parametrize final confusion test
* cut down some more on test cases and fix bug with prior fix
* Rewrite of _delete_host code sharing with update method
This is a start-to-finish rewrite of the host overwrite bug fix
this method is much more conservative,
it does this by keeping the overall code structure where hosts
are deleted before host updates are made
To fix the bug, we share code between the method that deletes hosts
and the method that updates the hosts
A data structure is created and passed to both methods
By having both methods use the same data structure which maps
the in-memory hosts to DB hosts, we assure that the deletions
will ONLY delete hosts that will not be updated
* Full finalize the planned work for health checks of execution nodes
* Implementation of instance health_check endpoint
* Also do version conditional to node_type
* Do not use receptor mesh to check main cluster nodes health
* Fix bugs from testing health check of cluster nodes, add doc
* Add a few fields to health check serializer missed before
* Light refactoring of error field processing
* Fix errors clearing error, write more unit tests
* Update health check info in docs
* Bump migration of health check after rebase
* Mark string for translation
* Add related health_check link for system auditors too
* Handle health_check cluster node timeout, add errors for peer judgement
* Exclude control-only nodes from IG policy calculations
Also, as a reverse to this, exclude execution-only nodes from
the calculations if the group in question is the controlplane
* Incorporate review comments
* Always use controlplane as project update backup IG
Before, this was done conditionally to container_group jobs
this logic changes it so that controlgroup will always be a
firm backstop for project updates
* Code a little more defensively to make unit tests pass
* Fix unit tests
* Clean up added work_type processing for mesh_code branch
* track both execution and control capacity
* Remove unused execution_capacity property
* Count all forms of capacity to make test pass
* Force jobs to be on execution nodes, updates on control nodes
* Introduce capacity_type property to abstract some details out
* Update test to cover all job types at same time
* Register OpenShift nodes as control types
* Remove unqualified consumed_capacity from task manager and make unit tests work
* Remove unqualified consumed_capacity from task manager and make unit tests work
* Update unit test to execution vs control TM logic changes
* Fix bug, else handling for work_type method
* Model changes for instance last_seen field to replace modified
* Break up refresh_capacity into smaller units
* Rename execution node methods, fix last_seen clustering
* Use update_fields to make it clear save only affects capacity
* Restructing to pass unit tests
* Fix bug where a PATCH did not update capacity value
* Introduce utilities for --worker-info health check integration
* Handle case where ansible-runner is not installed
* Add ttl parameter for health check
* Reformulate return data structure and add lots of error cases
* Move up the cleanup tasks, close sockets
* Integrate new --worker-info into the execution node capacity check
* Undo the raw value override from the PoC
* Additional refinement to execution node check frequency
* Put in more complete network diagram
* Followup on comment to remove modified from from health check responsibilities