Commit Graph

1279 Commits

Author SHA1 Message Date
Alan Rominger
3a3fffb2dd Fixed error dropped on floor - save receptor detail when it applies 2021-11-10 08:50:16 +08:00
Bianca Henderson
58cdbca5cf Update error message to be more accurate 2021-11-10 08:50:15 +08:00
Bianca Henderson
8275082896 Update error messages for when exceptions are caught 2021-11-10 08:50:14 +08:00
Bianca Henderson
d79da1ef9f Catch exceptions that might pop up when releasing work units 2021-11-10 08:50:14 +08:00
Alan Rominger
ecb84e090c Revert "Merge pull request #5354 from ansible/jobs_killed_via_receptor_should_get_reaped"
This reverts commit 8736858d80, reversing
changes made to 84e77c9db9.
2021-11-10 08:50:14 +08:00
Jim Ladd
0f77ca605d add unit tests 2021-11-10 08:50:14 +08:00
Jim Ladd
231fcc8178 drop lines picked up during merge resolution 2021-11-10 08:50:13 +08:00
Alan Rominger
2839091b22 Avoid extra check if we have job_explanation 2021-11-10 08:50:13 +08:00
Alan Rominger
47e67481b3 Avoid reaping tentative jobs 2021-11-10 08:50:13 +08:00
Alan Rominger
55059b015f Avoid resultsock shutdown before reading from it 2021-11-10 08:50:13 +08:00
Alan Rominger
eb6c58682d Alternative for reaping lost jobs, in work unit reaper 2021-11-10 08:50:13 +08:00
Jim Ladd
26055de772 cancel job if receptor no longer knows about the work item 2021-11-10 08:50:13 +08:00
Jim Ladd
ebb4581595 update exception log message to be descriptive
.. instead of surfacing exception
2021-11-10 08:50:12 +08:00
Jim Ladd
d1fecc11c9 when releasing receptor work, do so in try/except 2021-11-10 08:50:12 +08:00
Jeff Bradberry
7010015e8a Change the ActivityStream registration for InstanceGroups
to include the m2m fields.  Also to avoid spamminess, disable the
activity stream on the apply_cluster_membership_policies task.
2021-11-10 08:50:12 +08:00
Jeff Bradberry
1e5231d68b Enable ActivityStream capture for Instances 2021-11-10 08:50:12 +08:00
Alan Rominger
77076dbd67 Reduce the number of triggers for execution node health checks 2021-11-10 08:50:11 +08:00
Alan Rominger
f34c96ecf5 Error handling when node is missing from mesh for jobs and checks 2021-11-10 08:50:11 +08:00
chris meyers
3065e29deb avoid work_results and work release race
* Unsure exactly why this happens but there seems to be a race condition
  related to the time window between Receptor work_results and work
  release. This sleep extends that window and hopefully avoids the race
  condition.
2021-11-10 08:50:10 +08:00
Bianca Henderson
481047bed8 Change log level from 'warning' to 'exception' 2021-11-10 08:50:10 +08:00
Bianca Henderson
f72292cce2 Move error handling into try/catch block 2021-11-10 08:50:10 +08:00
Alan Rominger
7b35902d33 Respect settings to keep files and work units
Add new logic to cleanup orphaned work units
  from administrative tasks

Remove noisy log which is often irrelevant
  about running-cleanup-on-execution-nodes
  we already have other logs for this
2021-11-10 08:50:10 +08:00
Ethan Paul
c77aaece1d Skip additional instance checks on unrecognized hosts
Skip checking the health of a mesh instance when the instance is not registered
with the application. This prevents encountering an 'UnbouncLocalError' when
running the application attached to a multi-use Receptor mesh network

Signed-off-by: Ethan Paul <24588726+enpaul@users.noreply.github.com>
2021-10-29 14:06:36 -04:00
Shane McDonald
60a357eda1 Merge pull request #10906 from oweel/10829-idle_timeout_setting
Added idle_timeout setting to job settings
2021-10-13 13:16:53 -04:00
chris meyers
611a537b55 add missing create partition for scm backed inv
* This will resolve missing project update job events issue
2021-10-13 07:51:40 -04:00
chris meyers
64811d0b6b fix python black lint requirements 2021-10-12 17:09:30 -04:00
Amol Gautam
f79a57c3e2 Changed Work Submission parameter for K8s work 2021-10-11 08:10:26 -07:00
Alan Rominger
b70793db5c Consolidate cleanup actions under new ansible-runner worker cleanup command (#11160)
* Primary development of integrating runner cleanup command

* Fixup image cleanup signals and their tests

* Use alphabetical sort to solve the cluster coordination problem

* Update test to new pattern

* Clarity edits to interface with ansible-runner cleanup method

* Another change corresponding to ansible-runner CLI updates

* Fix incomplete implementation of receptor remote cleanup

* Share receptor utils code between worker_info and cleanup

* Complete task logging from calling runner cleanup command

* Wrap up unit tests and some contract changes that fall out of those

* Fix bug in CLI construction

* Fix queryset filter bug
2021-10-05 16:32:03 -04:00
Amol Gautam
24a6edef9e AWX dev environment changes for receptor work signing feature
-- Updated devel build to take most recent receptor binary
-- Added signWork parameter when sedning job to receptor
-- Modified docker-compose tasks to generate RSA key pair to use for work-signing
-- Modified docker-compose templates and jinja templates for implementing work-sign
-- Modified Firewall rules on the receptor jinja config

Add firewall rules to dev env
2021-10-05 11:41:34 -07:00
Alan Rominger
af5f8e8a4a Always set project sync execution_node to current host (#11204) 2021-10-05 13:08:40 -04:00
Jim Ladd
815a45cf2f call 'work cancel' on inactive controller jobs 2021-10-01 12:55:06 -07:00
Alan Rominger
7c9626b0e7 Fix bug that would run --worker-info health checks on control or hybrid nodes (#11161)
* Fix bug that would run health check on control nodes

* Prevent running execution node health check against main cluster nodes
2021-09-29 09:34:22 -04:00
Alan Rominger
3664cc3369 Disable autodiscovery except for docker-compose (#11142) 2021-09-27 11:36:11 -04:00
Marcelo Moreira de Mello
6d4b4cac37 Project updates must run on controller nodes
For project updates jobs triggered due a job template run,
we must ensure that project_update job to run on at the same
controller which dispatched the original job template, otherwise
the job might fail for being unable to find the playbook YAML file.
2021-09-25 23:05:45 -04:00
Marcelo Moreira de Mello
045785c36f Refactored get_conn_type() method to use Enum 2021-09-23 10:51:50 -04:00
Marcelo Moreira de Mello
45600d034d Initial StreamTLS support for receptor nodes 2021-09-23 10:50:17 -04:00
Alan Rominger
e914c23c42 Pass --delete flag to worker for execution node cleanup (#11078)
* Pass --delete flag to worker for execution node cleanup

* Remove the pdd_wrapper_ directory
2021-09-16 11:21:41 -04:00
beeankha
48eb06f320 Add verify_ssl to container_auth_data params 2021-09-16 09:49:53 -04:00
beeankha
ac8b49b39d Change the way auth info is passed to Runner for EEs pulled from protected registries 2021-09-15 08:49:28 -04:00
Alan Rominger
6a17e5b65b Allow manually running a health check, and make other adjustments to the health check trigger (#11002)
* Full finalize the planned work for health checks of execution nodes

* Implementation of instance health_check endpoint

* Also do version conditional to node_type

* Do not use receptor mesh to check main cluster nodes health

* Fix bugs from testing health check of cluster nodes, add doc

* Add a few fields to health check serializer missed before

* Light refactoring of error field processing

* Fix errors clearing error, write more unit tests

* Update health check info in docs

* Bump migration of health check after rebase

* Mark string for translation

* Add related health_check link for system auditors too

* Handle health_check cluster node timeout, add errors for peer judgement
2021-09-03 16:37:37 -04:00
Alan Rominger
5a6e9a06e2 Exclude control-only nodes from IG policy calculations (#10985)
* Exclude control-only nodes from IG policy calculations

Also, as a reverse to this, exclude execution-only nodes from
  the calculations if the group in question is the controlplane

* Incorporate review comments
2021-09-01 08:09:46 -04:00
Alan Rominger
dc4b014d12 Make status command in error handling cleaner (#10823) 2021-08-31 12:02:39 -04:00
Jim Ladd
467a37f8fe use settings.DEFAULT_EXECUTION_QUEUE_NAME in lieu of default 2021-08-26 11:15:14 -07:00
Jim Ladd
88a6412b54 only need to update IG's policy_instance_list field 2021-08-26 11:15:14 -07:00
Jim Ladd
502eaf9fb9 handle case where default IG does not exist
* also, only add discovered execution node to default group
  if `register`-ing the node actually resulted in a confirmed
  change
2021-08-26 11:15:14 -07:00
Jim Ladd
de8eab0434 inspect_execution_nodes should *not* block when retreiving lock
* would otherwise hold up cluster heartbeat task
* furthermore, only really need one node to run through
  `inspect_execution_nodes` each interval
2021-08-26 11:15:14 -07:00
Jim Ladd
f317fca9e4 add auto-discovered nodes to default IG
* add advisory_lock to avoid IG update race logic
* update IG by way of policy_instance_list
2021-08-26 11:15:14 -07:00
Jim Ladd
561fc289fb disable discovered instances by default 2021-08-26 11:15:14 -07:00
Alan Rominger
daf4310176 Clean up work_type processing and fix execution vs control capacity (#10930)
* Clean up added work_type processing for mesh_code branch

* track both execution and control capacity

* Remove unused execution_capacity property

* Count all forms of capacity to make test pass

* Force jobs to be on execution nodes, updates on control nodes

* Introduce capacity_type property to abstract some details out

* Update test to cover all job types at same time

* Register OpenShift nodes as control types

* Remove unqualified consumed_capacity from task manager and make unit tests work

* Remove unqualified consumed_capacity from task manager and make unit tests work

* Update unit test to execution vs control TM logic changes

* Fix bug, else handling for work_type method
2021-08-26 07:24:14 -04:00
Shane McDonald
274e487a96 Attempt to surface streaming errors that were being eaten (#10918) 2021-08-24 10:33:00 -04:00