Commit Graph

1258 Commits

Author SHA1 Message Date
Alan Rominger
97ce7d226b Error handling when node is missing from mesh for jobs and checks 2021-10-19 10:06:02 -04:00
chris meyers
098201be0b avoid work_results and work release race
* Unsure exactly why this happens but there seems to be a race condition
  related to the time window between Receptor work_results and work
  release. This sleep extends that window and hopefully avoids the race
  condition.
2021-10-15 16:27:24 -04:00
Bianca Henderson
0e15c269cd Merge pull request #5351 from ansible/exec_node_error_handling
[4.1] Improve Error Handling for When Job Cannot Be Delivered to an Execution Node
2021-10-15 13:29:27 -04:00
Alan Rominger
54e3377254 Respect settings to keep files and work units
Add new logic to cleanup orphaned work units
  from administrative tasks

Remove noisy log which is often irrelevant
  about running-cleanup-on-execution-nodes
  we already have other logs for this
2021-10-15 11:36:36 -04:00
Bianca Henderson
dd622dcc30 Change log level from 'warning' to 'exception' 2021-10-15 08:08:24 -04:00
Bianca Henderson
412bf7f282 Move error handling into try/catch block 2021-10-14 14:52:38 -04:00
chris meyers
611a537b55 add missing create partition for scm backed inv
* This will resolve missing project update job events issue
2021-10-13 07:51:40 -04:00
chris meyers
64811d0b6b fix python black lint requirements 2021-10-12 17:09:30 -04:00
Amol Gautam
f79a57c3e2 Changed Work Submission parameter for K8s work 2021-10-11 08:10:26 -07:00
Alan Rominger
b70793db5c Consolidate cleanup actions under new ansible-runner worker cleanup command (#11160)
* Primary development of integrating runner cleanup command

* Fixup image cleanup signals and their tests

* Use alphabetical sort to solve the cluster coordination problem

* Update test to new pattern

* Clarity edits to interface with ansible-runner cleanup method

* Another change corresponding to ansible-runner CLI updates

* Fix incomplete implementation of receptor remote cleanup

* Share receptor utils code between worker_info and cleanup

* Complete task logging from calling runner cleanup command

* Wrap up unit tests and some contract changes that fall out of those

* Fix bug in CLI construction

* Fix queryset filter bug
2021-10-05 16:32:03 -04:00
Amol Gautam
24a6edef9e AWX dev environment changes for receptor work signing feature
-- Updated devel build to take most recent receptor binary
-- Added signWork parameter when sedning job to receptor
-- Modified docker-compose tasks to generate RSA key pair to use for work-signing
-- Modified docker-compose templates and jinja templates for implementing work-sign
-- Modified Firewall rules on the receptor jinja config

Add firewall rules to dev env
2021-10-05 11:41:34 -07:00
Alan Rominger
af5f8e8a4a Always set project sync execution_node to current host (#11204) 2021-10-05 13:08:40 -04:00
Jim Ladd
815a45cf2f call 'work cancel' on inactive controller jobs 2021-10-01 12:55:06 -07:00
Alan Rominger
7c9626b0e7 Fix bug that would run --worker-info health checks on control or hybrid nodes (#11161)
* Fix bug that would run health check on control nodes

* Prevent running execution node health check against main cluster nodes
2021-09-29 09:34:22 -04:00
Alan Rominger
3664cc3369 Disable autodiscovery except for docker-compose (#11142) 2021-09-27 11:36:11 -04:00
Marcelo Moreira de Mello
6d4b4cac37 Project updates must run on controller nodes
For project updates jobs triggered due a job template run,
we must ensure that project_update job to run on at the same
controller which dispatched the original job template, otherwise
the job might fail for being unable to find the playbook YAML file.
2021-09-25 23:05:45 -04:00
Marcelo Moreira de Mello
045785c36f Refactored get_conn_type() method to use Enum 2021-09-23 10:51:50 -04:00
Marcelo Moreira de Mello
45600d034d Initial StreamTLS support for receptor nodes 2021-09-23 10:50:17 -04:00
Alan Rominger
e914c23c42 Pass --delete flag to worker for execution node cleanup (#11078)
* Pass --delete flag to worker for execution node cleanup

* Remove the pdd_wrapper_ directory
2021-09-16 11:21:41 -04:00
beeankha
48eb06f320 Add verify_ssl to container_auth_data params 2021-09-16 09:49:53 -04:00
beeankha
ac8b49b39d Change the way auth info is passed to Runner for EEs pulled from protected registries 2021-09-15 08:49:28 -04:00
Alan Rominger
6a17e5b65b Allow manually running a health check, and make other adjustments to the health check trigger (#11002)
* Full finalize the planned work for health checks of execution nodes

* Implementation of instance health_check endpoint

* Also do version conditional to node_type

* Do not use receptor mesh to check main cluster nodes health

* Fix bugs from testing health check of cluster nodes, add doc

* Add a few fields to health check serializer missed before

* Light refactoring of error field processing

* Fix errors clearing error, write more unit tests

* Update health check info in docs

* Bump migration of health check after rebase

* Mark string for translation

* Add related health_check link for system auditors too

* Handle health_check cluster node timeout, add errors for peer judgement
2021-09-03 16:37:37 -04:00
Alan Rominger
5a6e9a06e2 Exclude control-only nodes from IG policy calculations (#10985)
* Exclude control-only nodes from IG policy calculations

Also, as a reverse to this, exclude execution-only nodes from
  the calculations if the group in question is the controlplane

* Incorporate review comments
2021-09-01 08:09:46 -04:00
Alan Rominger
dc4b014d12 Make status command in error handling cleaner (#10823) 2021-08-31 12:02:39 -04:00
Jim Ladd
467a37f8fe use settings.DEFAULT_EXECUTION_QUEUE_NAME in lieu of default 2021-08-26 11:15:14 -07:00
Jim Ladd
88a6412b54 only need to update IG's policy_instance_list field 2021-08-26 11:15:14 -07:00
Jim Ladd
502eaf9fb9 handle case where default IG does not exist
* also, only add discovered execution node to default group
  if `register`-ing the node actually resulted in a confirmed
  change
2021-08-26 11:15:14 -07:00
Jim Ladd
de8eab0434 inspect_execution_nodes should *not* block when retreiving lock
* would otherwise hold up cluster heartbeat task
* furthermore, only really need one node to run through
  `inspect_execution_nodes` each interval
2021-08-26 11:15:14 -07:00
Jim Ladd
f317fca9e4 add auto-discovered nodes to default IG
* add advisory_lock to avoid IG update race logic
* update IG by way of policy_instance_list
2021-08-26 11:15:14 -07:00
Jim Ladd
561fc289fb disable discovered instances by default 2021-08-26 11:15:14 -07:00
Alan Rominger
daf4310176 Clean up work_type processing and fix execution vs control capacity (#10930)
* Clean up added work_type processing for mesh_code branch

* track both execution and control capacity

* Remove unused execution_capacity property

* Count all forms of capacity to make test pass

* Force jobs to be on execution nodes, updates on control nodes

* Introduce capacity_type property to abstract some details out

* Update test to cover all job types at same time

* Register OpenShift nodes as control types

* Remove unqualified consumed_capacity from task manager and make unit tests work

* Remove unqualified consumed_capacity from task manager and make unit tests work

* Update unit test to execution vs control TM logic changes

* Fix bug, else handling for work_type method
2021-08-26 07:24:14 -04:00
Shane McDonald
274e487a96 Attempt to surface streaming errors that were being eaten (#10918) 2021-08-24 10:33:00 -04:00
Alan Rominger
940c189c12 Corresponding AWX changes for runner --worker-info schema update (#10926) 2021-08-24 08:41:36 -04:00
Alan Rominger
928c35ede5 Model changes for instance last_seen field to replace modified (#10870)
* Model changes for instance last_seen field to replace modified

* Break up refresh_capacity into smaller units

* Rename execution node methods, fix last_seen clustering

* Use update_fields to make it clear save only affects capacity

* Restructing to pass unit tests

* Fix bug where a PATCH did not update capacity value
2021-08-24 08:41:35 -04:00
Alan Rominger
3b1e40d227 Use the ansible-runner worker --worker-info to perform execution node capacity checks (#10825)
* Introduce utilities for --worker-info health check integration

* Handle case where ansible-runner is not installed

* Add ttl parameter for health check

* Reformulate return data structure and add lots of error cases

* Move up the cleanup tasks, close sockets

* Integrate new --worker-info into the execution node capacity check

* Undo the raw value override from the PoC

* Additional refinement to execution node check frequency

* Put in more complete network diagram

* Followup on comment to remove modified from from health check responsibilities
2021-08-24 08:41:35 -04:00
Alan Rominger
4e84c7c4c4 Use the existing get_receptor_ctl method (#10813) 2021-08-24 08:41:35 -04:00
Alan Rominger
f47eb126e2 Adopt the node_type field in receptor logic (#10802)
* Adopt the node_type field in receptor logic

* Refactor Instance.objects.register so we do not reset capacity to 0
2021-08-24 08:41:34 -04:00
Alan Rominger
9881bb72b8 Treat the awx_1 node as a hybrid node for now, use local work type (#10726) 2021-08-24 08:40:21 -04:00
Alan Rominger
f597205fa7 Run capacity checks with container isolation (#10688)
This requires swapping out the container images
  for the execution nodes from awx-ee to the awx image

For completeness, the hop node image is switched to the raw
  receptor image

A few outright bugs are fixed here
  memory calculation just was not right at all
  the execution_capacity calculation was reverse of intention

Drop in a few TODOs about error handling from debugging
2021-08-24 08:40:19 -04:00
Alan Rominger
e7be86867d Fix rebase bug specific to ad hoc commands 2021-08-24 08:40:19 -04:00
Alan Rominger
13300bdbd4 Update rebase to keep old control plane capacity check
Also do some basic work to separate control versus execution capacity
  this is to assure that we don't send jobs to the control node
2021-08-24 08:40:19 -04:00
Alan Rominger
39e23db523 Make minor changes to add needed imports 2021-08-24 08:40:19 -04:00
Ryan Petrello
05cb876df5 implement an initial development environment for receptor-based clusters 2021-08-24 08:40:18 -04:00
softwarefactory-project-zuul[bot]
68e309ee32 Merge pull request #10607 from AlanCoding/unused_exception
Remove unused exception about custom venvs

random cleanup

Reviewed-by: Bianca Henderson <beeankha@gmail.com>
2021-07-09 15:43:00 +00:00
Alan Rominger
17f9b57028 Remove unused exception about custom venvs 2021-07-07 11:38:37 -04:00
Alan Rominger
e96080a512 No result_traceback is blank, not null 2021-07-07 11:37:30 -04:00
Alan Rominger
f126a6343b Fix bug setting execution_node to null (not blank) (#5169) 2021-06-28 10:51:06 -04:00
Shane McDonald
1ed170fff0 Dont overwrite result_traceback if it was already set. 2021-06-28 10:51:06 -04:00
Alan Rominger
390e1f9a0a Fix obvious logical bug with project folder pre-creation (#5155) 2021-06-28 10:51:04 -04:00
Shane McDonald
397908543d Disable activity stream for updates in status handler 2021-06-28 10:51:04 -04:00