External-Mirrors/awx

mirror of https://github.com/ansible/awx.git synced 2026-02-15 02:00:01 -03:30

Author	SHA1	Message	Date
Alan Rominger	d571b9bbbc	Refactor test_get_cleanup_task_kwargs_active_jobs and add new test This takes some logic out of the queryset logic, using some established assumptions about the task manager if a job lands on a hybrid node (or is a project update) then it will have the same controller and execution node With that established, the queryset can be simplified	2022-11-02 15:14:16 -04:00
Jeff Bradberry	65179d9cd0	Add a new Instance.health_check_started field This will enable us to provide more useful information for the user, now that all user-triggered health checks are async. Also, de-bounce the health check endpoint to not allow additional health check tasks to be triggered when one is already in progress.	2022-09-27 17:09:41 -04:00
Jeff Bradberry	08c18d71bf	Move InstanceLink creation and updating to the async tasks So that they get applied in situations that do not go through the API.	2022-09-23 09:46:14 -04:00
Seth Foster	eaa4f2483f	Run instance health check in task container awx-web container does not have access to receptor socket, and the execution node health check requires receptorctl. This change runs the health check asynchronously in the task container.	2022-09-23 09:46:14 -04:00
Jeff Bradberry	1b650d6927	When deprovisioning a node, kick off a task that waits on running jobs After all jobs on the node are complete, delete the node then broadcast the write_receptor_config task. Also, make sure that write_receptor_config updates the state of links that are in 'adding' state.	2022-09-23 09:46:13 -04:00
Jeff Bradberry	3bc86ca8cb	Follow up on new execution node creation - hop nodes are descoped - links need to be created on execution node creation - expose the 'edit' capabilities on the instance serializer	2022-09-23 09:46:13 -04:00
Shane McDonald	9b034ad574	generate control node receptor.conf when a new remote execution/hop node is added regenerate the receptor.conf for all control node to peer out to the new remote execution node Signed-off-by: Hao Liu <haoli@redhat.com> Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com> Co-Authored-By: Shane McDonald <me@shanemcd.com>	2022-09-23 09:46:12 -04:00
Jeff Bradberry	3bcd539b3d	Make sure that the health checks handle the state transitions properly - nodes with states Provisioning, Provisioning Fail, Deprovisioning, and Deprovisioning Fail should bypass health checks and should never transition due to the existing machinery - nodes with states Unavailable and Installed can transition to Ready if they check out as healthy - nodes in the Ready state should transition to Unavailable if they fail a check	2022-09-23 09:46:11 -04:00
Jeff Bradberry	2fba3db48f	Add state fields to Instance and InstanceLink Also, listener_port to Instance.	2022-09-23 09:46:11 -04:00
Alan Rominger	61093b2532	Treat instance_groups prompt as template-less	2022-09-22 16:08:22 -04:00
Alan Rominger	68e11d2b81	Add WorkflowJob.instance_groups and distinguish from char_prompts This removes a loop that ran on import the loop was giving the wrong behavior and it initialized too many fields as char_prompts fields With this, we will now enumerate the char_prompts type fields manually	2022-09-22 15:39:49 -04:00
John Westcott IV	4f5596eb0c	Adding unit/functional tests, fixing tests Making common class for LabelList Fixing related field name Fixing get_effective_slice_ct to look for corerct field and also override _eager_field	2022-09-22 15:39:16 -04:00
John Westcott IV	809df74050	Adding EE/IG/labels/forks/timeout/job_slice_count to schedules Modifying schedules to work with related fields Updating awx.awx.workflow_job_template_node	2022-09-22 15:35:27 -04:00
John Westcott IV	33c0fb79d6	JT param everything (#12646 ) * Making almost all fields promptable on job templates and config models * Adding EE, IG and label access checks * Changing jobs preferred instance group function to handle the new IG cache field * Adding new ask fields to job template modules * Address unit/functional tests * Adding migration file	2022-09-22 15:16:12 -04:00
Shane McDonald	c5976e2584	Add setting for missed heartbeats before marking node offline	2022-08-17 11:39:30 -04:00
Seth Foster	e6f8852b05	Cache task_impact task_impact is now a field on the database It is calculated and set during create_unified_job set task_impact on .save for adhoc commands	2022-08-05 14:33:47 -04:00
Alan Rominger	cb63d92bbf	Remove committed_capacity field, delete supporting code (#12086 ) * Remove committed_capacity field, delete supporting code * Track consumed capacity to solve the negatives problem * Use more verbose name for IG queryset	2022-04-22 13:41:32 -04:00
Elijah DeLee	689a216726	move static methods used by task manager (#12050 ) * move static methods used by task manager These static methods were being used to act on Instance-like objects that were SimpleNamespace objects with the necessary attributes. This change introduces dedicated classes to replace the SimpleNamespace objects and moves the formerlly staticmethods to a place where they are more relevant instead of tacked onto models to which they were only loosly related. Accept in-memory data structure in init methods for tests * initialize remaining capacity AFTER we built map of instances	2022-04-21 13:05:06 -04:00
Elijah DeLee	2e9974133a	calculate remaining capacity in static method this is to avoid additional queries when we allready have all the active jobs fetched in the task manager	2022-04-13 11:56:07 -04:00
Elijah DeLee	4328b4cb67	drop call that queries all running and waiting jobs this is to fix one more place in the task manager where we end up querying all running and waiting jobs. Partial fix for https://github.com/ansible/awx/issues/11671	2022-04-12 10:31:47 -04:00
Alan Rominger	3d22c8ae91	Merge pull request #11968 from AlanCoding/cleanup_tweaks Minor tweaks to ansible-runner cleanup task arguments	2022-03-29 15:00:33 -04:00
Alan Rominger	deac08ba8a	Add regression test for overly agressive cleanup behavior	2022-03-28 22:23:33 -04:00
Jeff Bradberry	6c1adade25	Merge pull request #11947 from jbradberry/django-3.2-upgrade Remove the out-of-band JSONField migration	2022-03-28 12:02:53 -04:00
Alan Rominger	85ec83c3fd	Minor tweaks to ansible-runner cleanup task arguments	2022-03-28 10:52:09 -04:00
Lucas Dias	01ce3440eb	added os.path and module import	2022-03-25 14:26:00 +01:00
Jeff Bradberry	e3f3ab224a	Replace all previously text-based json fields with JSONBlob This JSONBlob field type is a wrapper around Django's new generic JSONField, but with the database column type forced to be text. This should behave close enough to our old wrapper around django-jsonfield's JSONField and will avoid needing to do the out-of-band database migration.	2022-03-24 15:21:54 -04:00
Lucas Dias	18b1440d7c	fixed hardcode tmp ha.py	2022-03-24 17:59:43 +01:00
Jeff Bradberry	ac6a82eee4	Merge pull request #11654 from jbradberry/django-3.2-upgrade Django 3.2 upgrade	2022-03-17 10:34:22 -04:00
Alan Rominger	99bbc347ec	Fill in `errors` for hop nodes when Last Seen is out of date, and clear them when not (#11714 ) * Process unresponsive and newly responsive hop nodes * Use more natural way to zero hop node capacity, add test * Use warning as opposed to warn for log messages	2022-03-09 13:21:32 -05:00
Jeff Bradberry	05142a779d	Replace all usage of customized json fields with the Django builtin The event_data field on event models, however, is getting an overridden version that retains the underlying text data type for the column, to avoid a heavy data migration on those tables. Also, certain of the larger tables are getting these fields with the NOT NULL constraint turned off, to avoid a long migration. Remove the django.utils.six monkey patch we did at the beginning of the upgrade.	2022-03-07 18:11:36 -05:00
Jeff Bradberry	b852baaa39	Fix up logger .warn() calls to use .warning() instead This is a usage that was deprecated in Python 3.0.	2022-03-07 18:11:36 -05:00
Jeff Bradberry	a3a216f91f	Fix up new Django 3.0 deprecations Mostly text based: force/smart_text, ugettext_*	2022-03-07 18:11:36 -05:00
Elijah DeLee	799968460d	Fixup conversion of memory and cpu settings to support k8s resource request format (#11725 ) fix memory and cpu settings to suport k8s resource request format * fix conversion of memory setting to bytes This setting has not been getting set by default, and needed some fixing up to be compatible with setting the memory in the same way as we set it in the operator, as well as with other changes from last year which assume that ansible runner is returning memory in bytes. This way we can start setting this setting in the operator, and get a more accurate reflection of how much memory is available to the control pod in k8s. On platforms where services are all sharing memory, we deduct a penalty from the memory available. On k8s we don't need to do this because the web, redis, and task containers each have memory allocated to them. * Support CPU setting expressed in units used by k8s This setting has not been getting set by default, and needed some fixing up to be compatible with setting the CPU resource request/limits in the same way as we set it in the resource requests/limits. This way we can start setting this setting in the operator, and get a more accurate reflection of how much cpu is available to the control pod in k8s. Because cpu on k8s can be partial cores, migrate cpu field to decimal. k8s does not allow granularity of less than 100m (equivalent to 0.1 cores), so only store up to 1 decimal place. fix analytics to deal with decimal cpu need to use DjangoJSONEncoder when Decimal fields in data passed to json.dumps	2022-02-15 14:08:24 -05:00
Elijah DeLee	604cbc1737	Consume control capacity (#11665 ) * Select control node before start task Consume capacity on control nodes for controlling tasks and consider remainging capacity on control nodes before selecting them. This depends on the requirement that control and hybrid nodes should all be in the instance group named 'controlplane'. Many tests do not satisfy that requirement. I'll update the tests in another commit. * update tests to use controlplane We don't start any tasks if we don't have a controlplane instance group Due to updates to fixtures, update tests to set node type and capacity explicitly so they get expected result. * Fixes for accounting of control capacity consumed Update method is used to account for currently consumed capacity for instance groups in the in-memory capacity tracking data structure we initialize in after_lock_init and then update via calculate_capacity_consumed (both in task_manager.py) Also update fit_task_to_instance to consider control impact on instances Trust that these functions do the right thing looking for a node with capacity, and cut out redundant check for the whole group's capacity per Alan's reccomendation. * Refactor now redundant code Deal with control type tasks before we loop over the preferred instance groups, which cuts out the need for some redundant logic. Also, fix a bug where I was missing assigning the execution node in one case! * set job explanation on tasks that need capacity move the job explanation for jobs that need capacity to a function so we can re-use it in the three places we need it. * project updates always run on the controlplane Instance group ordering makes no sense on project updates because they always need to run on the control plane. Also, since hybrid nodes should always run the control processes for the jobs running on them as execution nodes, account for this when looking for a execution node. * fix misleading message the variables and wording were both misleading, fix to be more accurate description in the two different cases where this log may be emitted. * use settings correctly use settings.DEFAULT_CONTROL_PLANE_QUEUE_NAME instead of a hardcoded name cache the controlplane_ig object during the after lock init to avoid an uneccesary query eliminate mistakenly duplicated AWX_CONTROL_PLANE_TASK_IMPACT and use only AWX_CONTROL_NODE_TASK_IMPACT * add test for control capacity consumption add test to verify that when there are 2 jobs and only capacity for one that one will move into waiting and the other stays in pending * add test for hybrid node capacity consumption assert that the hybrid node is used for both control and execution and capacity is deducted correctly * add test for task.capacity_type = control Test that control type tasks have the right capacity consumed and get assigned to the right instance group Also fix lint in the tests * jobs_running not accurate for control nodes We can either NOT use "idle instances" for control nodes, or we need to update the jobs_running property on the Instance model to count jobs where the node is the controller_node. I didn't do that because it may be an expensive query, and it would be hard to make it match with jobs_running on the InstanceGroup which filters on tasks assigned to the instance group. This change chooses to stop considering "idle" control nodes an option, since we can't acurrately identify them. The way things are without any change, is we are continuing to over consume capacity on control nodes because this method sees all control nodes as "idle" at the beginning of the task manager run, and then only counts jobs started in that run in the in-memory tracking. So jobs which last over a number of task manager runs build up consuming capacity, which is accurately reported via Instance.consumed_capacity * Reduce default task impact for control nodes This is something we can experiment with as far as what users want at install time, but start with just 1 for now. * update capacity docs Describe usage of the new setting and the concept of control impact. Co-authored-by: Alan Rominger <arominge@redhat.com> Co-authored-by: Rebeccah <rhunter@redhat.com>	2022-02-14 10:13:22 -05:00
Alan Rominger	285ff080d0	Prevent duplicate query in local health check	2022-01-27 15:27:07 -05:00
Jeff Bradberry	334c33ca07	Handle receptorctl advertisements for hop nodes counting it towards their heartbeat. Also, leave off the link to the health check endpoint from hop node Instances.	2022-01-24 16:51:45 -05:00
Amol Gautam	a4a3ba65d7	Refactored tasks.py to a package --- Added 3 new sub-package : awx.main.tasks.system , awx.main.tasks.jobs , awx.main.tasks.receptor --- Modified the functional tests and unit tests accordingly	2022-01-14 11:55:41 -05:00
Jeff Bradberry	f340f491dc	Control the visibility and use of hop node Instances - the list, detail, and health check API views should not include them - the Instance-InstanceGroup association views should not allow them to be changed - the ping view excludes them - list_instances management command excludes them - Instance.set_capacity_value sets hop nodes to 0 capacity - TaskManager will exclude them from the nodes available for job execution - TaskManager.reap_jobs_from_orphaned_instances will consider hop nodes to be an orphaned instance - The apply_cluster_membership_policies task will not manipulate hop nodes - get_broadcast_hosts will ignore hop nodes - active_count also will ignore hop nodes	2021-12-17 14:30:28 -05:00
Jeff Bradberry	c8f1e714e1	Capture hop nodes and the peer links between nodes	2021-12-17 14:30:18 -05:00
Alan Rominger	7b35902d33	Respect settings to keep files and work units Add new logic to cleanup orphaned work units from administrative tasks Remove noisy log which is often irrelevant about running-cleanup-on-execution-nodes we already have other logs for this	2021-11-10 08:50:10 +08:00
Alan Rominger	b70793db5c	Consolidate cleanup actions under new `ansible-runner worker cleanup` command (#11160 ) * Primary development of integrating runner cleanup command * Fixup image cleanup signals and their tests * Use alphabetical sort to solve the cluster coordination problem * Update test to new pattern * Clarity edits to interface with ansible-runner cleanup method * Another change corresponding to ansible-runner CLI updates * Fix incomplete implementation of receptor remote cleanup * Share receptor utils code between worker_info and cleanup * Complete task logging from calling runner cleanup command * Wrap up unit tests and some contract changes that fall out of those * Fix bug in CLI construction * Fix queryset filter bug	2021-10-05 16:32:03 -04:00
Alan Rominger	3fc63489f1	Filter controller_node selection to online nodes (#11120 )	2021-09-24 23:01:32 -04:00
Rebeccah	55f2125a51	if the user provides a uuid and it exists, allow that to tie to the instance, which allows the user to update the instance based on the UUID (includeding updating the hostname) should they choose to do so.	2021-09-21 16:54:11 -04:00
Alan Rominger	6a17e5b65b	Allow manually running a health check, and make other adjustments to the health check trigger (#11002 ) * Full finalize the planned work for health checks of execution nodes * Implementation of instance health_check endpoint * Also do version conditional to node_type * Do not use receptor mesh to check main cluster nodes health * Fix bugs from testing health check of cluster nodes, add doc * Add a few fields to health check serializer missed before * Light refactoring of error field processing * Fix errors clearing error, write more unit tests * Update health check info in docs * Bump migration of health check after rebase * Mark string for translation * Add related health_check link for system auditors too * Handle health_check cluster node timeout, add errors for peer judgement	2021-09-03 16:37:37 -04:00
Jim Ladd	262cd3c695	set default uuid	2021-09-03 10:05:15 -07:00
Alan Rominger	424dbe8208	Use ansible-runner imports for cpu and memory calculation (#10954 ) * Use ansible-runner imports for cpu and memory calculation * Fix bug with capacity and memory adjustment	2021-08-27 21:46:53 -04:00
Alan Rominger	daf4310176	Clean up work_type processing and fix execution vs control capacity (#10930 ) * Clean up added work_type processing for mesh_code branch * track both execution and control capacity * Remove unused execution_capacity property * Count all forms of capacity to make test pass * Force jobs to be on execution nodes, updates on control nodes * Introduce capacity_type property to abstract some details out * Update test to cover all job types at same time * Register OpenShift nodes as control types * Remove unqualified consumed_capacity from task manager and make unit tests work * Remove unqualified consumed_capacity from task manager and make unit tests work * Update unit test to execution vs control TM logic changes * Fix bug, else handling for work_type method	2021-08-26 07:24:14 -04:00
Alan Rominger	940c189c12	Corresponding AWX changes for runner --worker-info schema update (#10926 )	2021-08-24 08:41:36 -04:00
Alan Rominger	928c35ede5	Model changes for instance last_seen field to replace modified (#10870 ) * Model changes for instance last_seen field to replace modified * Break up refresh_capacity into smaller units * Rename execution node methods, fix last_seen clustering * Use update_fields to make it clear save only affects capacity * Restructing to pass unit tests * Fix bug where a PATCH did not update capacity value	2021-08-24 08:41:35 -04:00
Alan Rominger	3b1e40d227	Use the ansible-runner worker --worker-info to perform execution node capacity checks (#10825 ) * Introduce utilities for --worker-info health check integration * Handle case where ansible-runner is not installed * Add ttl parameter for health check * Reformulate return data structure and add lots of error cases * Move up the cleanup tasks, close sockets * Integrate new --worker-info into the execution node capacity check * Undo the raw value override from the PoC * Additional refinement to execution node check frequency * Put in more complete network diagram * Followup on comment to remove modified from from health check responsibilities	2021-08-24 08:41:35 -04:00

1 2 3

138 Commits