Commit Graph

819 Commits

Author SHA1 Message Date
Alan Rominger
d3eb2c1975 Add new flak8 rules to do some meaningful corrections 2022-09-27 20:36:42 -04:00
Shane McDonald
260e1d4f2d Make static asset location consistent across all deployments (#12819) 2022-09-02 17:12:06 +00:00
Alan Rominger
974f845059 Revert "Merge pull request #12584 from AlanCoding/lazy_workers"
This reverts commit 64157f7207, reversing
changes made to 9e8ba6ca09.
2022-08-28 23:04:13 -04:00
Shane McDonald
2ef531b2dc Fix browsable API in development environment
Fallout from https://github.com/ansible/awx/pull/12722
2022-08-26 17:19:16 -04:00
Jessica Steurer
ff49cc5636 Merge pull request #12552 from whitej6/jlw-generic-oidc
Implement Generic OIDC Provider
2022-08-23 15:38:43 -03:00
Shane McDonald
1ed7a50755 Fix STATIC_ROOT in defaults
Reasoning:

- This is breaking the UI in official image builds of devel
- This is always being overridden in our packaging
- PROJECTS_ROOT and JOBOUTPUT_ROOT also hardcode /var/lib/awx
2022-08-23 12:39:54 -04:00
Jeremy White
9f3396d867 rebasing 2022-08-23 09:51:04 -05:00
Shane McDonald
6d11003975 Remove need for settings.py during image build 2022-08-22 13:46:42 -04:00
Shane McDonald
37d9c9eb1b Consolidate and refactor logging configuration code 2022-08-19 17:16:27 -04:00
Shane McDonald
c5976e2584 Add setting for missed heartbeats before marking node offline 2022-08-17 11:39:30 -04:00
Shane McDonald
3c51cb130f Add grace period settings for task manager timeout, and pod / job waiting reapers
Co-authored-by: Alan Rominger <arominge@redhat.com>
2022-08-17 11:39:01 -04:00
Alan Rominger
a9170236e1 Wait 60 seconds before scaling down a worker 2022-08-10 16:12:03 -04:00
Seth Foster
e6f8852b05 Cache task_impact
task_impact is now a field on the database
It is calculated and set during create_unified_job

set task_impact on .save for adhoc commands
2022-08-05 14:33:47 -04:00
Elijah DeLee
7eb0c7dd28 exit task manager loops early if we are timed out
add settings to define task manager timeout and grace period

This gives us still TASK_MANAGER_TIMEOUT_GRACE_PERIOD amount of time to
get out of the task manager.

Also, apply start task limit in WorkflowManager to starting pending
workflows
2022-08-05 14:33:24 -04:00
Seth Foster
ff118f2177 Manage pending workflow jobs in Workflow Manager
get_tasks uses UnifiedJob
Additionally, make local overrides run after development settings
2022-08-05 14:31:48 -04:00
Elijah DeLee
ad08eafb9a add debug views for task manager(s)
implement https://github.com/ansible/awx/issues/12446
in development environment, enable set of views that run
the task manager(s).

Also introduce a setting that disables any calls to schedule()
that do not originate from the debug views when in the development
environment. With guards around both if we are in the development
environment and the setting, I think we're pretty safe this won't get
triggered unintentionally.

use MODE to determine if we are in devel env

Also, move test for skipping task managers to the tasks file
2022-08-05 14:31:24 -04:00
Seth Foster
431b9370df Split TaskManager into
- DependencyManager spawns dependencies if necessary
- WorkflowManager processes running workflows to see if a new job is
  ready to spawn
- TaskManager starts tasks if unblocked and has execution capacity
2022-08-05 14:29:02 -04:00
Seth Foster
2f82b75748 Add subsystem metrics for task manager 2022-06-14 11:00:11 -04:00
Alan Rominger
aaad634483 Only use in-memory cache for database settings, set ttl=5 (#12166)
* Only use in-memory cache for database settings

Make necessary adjustments to monkeypatch
  as it is very vunerable to recursion
  Remove migration exception that is now redundant

Clear cache if a setting is changed

* Use dedicated middleware for setting cache stuff
  Clear cache for each request

* Add tests for in-memory cache
2022-05-10 21:58:22 -04:00
Alan Rominger
29d60844a8 Fix notification timing issue by sending in the latter of 2 events (#12110)
* Track host_status_counts and use that to process notifications

* Remove now unused setting

* Back out changes to callback class not needed after all

* Skirt the need for duck typing by leaning on the cached field

* Delete tests for deleted task

* Revert "Back out changes to callback class not needed after all"

This reverts commit 3b8ae350d218991d42bffd65ce4baac6f41926b2.

* Directly hardcode stats_event_type for callback class

* Fire notifications if stats event was never sent

* Remove test content for deleted methods

* Add placeholder for when no hosts matched

* Make field default be None, denote events processed with empty dict

* Make UI process null value for host_status_counts

* Fix tracking of EOF dispatch for system jobs

* Reorganize EVENT_MAP into class properties

* Consolidate conditional I missed from EVENT_MAP refactor

* Give up on the null condition, also applies for empty hosts

* Remove cls position argument not being used

* Move wrapup method out of class, add tests
2022-04-29 13:54:31 -04:00
John Westcott IV
a0ccc8c925 Merge pull request #5784 from ansible/runner_changes_42 (#12083) 2022-04-22 10:46:35 -04:00
Alan Rominger
7822da03fb Merge pull request #11865 from AlanCoding/galaxy_task_env
Add user-defined environment variables to ansible-galaxy commands
2022-04-01 15:24:54 -04:00
Alan Rominger
73e02e745a Patches to make jobs robust to database restarts (#11905)
* Simple patches to make jobs robust to database restarts

* Add some wait time before retrying loop due to DB error

* Apply dispatcher downtime setting to job updates, fix dispatcher bug

This resolves a bug where the pg_is_down property
  never had the right value
  the loop is normally stuck in the conn.events() iterator
  so it never recognized successful database interactions
  this lead to serial database outages terminating jobs

New setting for allowable PG downtime is shared with task code
  any calls to update_model will use _max_attempts parameter
  to make it align with the patience time that the dispatcher
  respects when consuming new events

* To avoid restart loops, handle DB errors on startup with prejudice

* If reconnect consistently fails, exit with non-zero code
2022-03-30 09:14:20 -04:00
Alan Rominger
c77667788a Add user-defined environment variables to ansible-galaxy commands 2022-03-29 09:57:40 -04:00
Shane McDonald
b6573ec2e2 Merge pull request #11961 from shanemcd/respect-system-tmp
Respect system configured tmp directory
2022-03-25 08:13:53 -04:00
Shane McDonald
ea59e895af Respect system configured tmp directory 2022-03-24 13:51:02 -04:00
Chris Meyers
59bd73bff8 add setting for notification job status retry loop
* We trigger notifications when the callback receiver processes the
playbook_on_stats event. This is the last event in ansible-playbook and
the process should exist very shortly after this event is emitted. The
trouble comes in with the isolated node feature. There is a management
playbook that runs periodically that pulls the events from the remote
node. It's possible that the management playbooks runs, gets the
playbook_on_stats event, but does not see that the playbook is finished
running. Therefore the job status is still seen as 'running' BUT we have
kicked of the notification for the job. The notification worker will
enter a loop waiting on the job to enter the finished state. In this
case the time it takes for the job to enter the finished state can be
long, roughly 2 * the management playbook run time.
* This new setting allows the user to increase the time that the
notification spends waiting for the job to enter the finished state.
2022-03-22 09:20:14 -04:00
Jeff Bradberry
ac6a82eee4 Merge pull request #11654 from jbradberry/django-3.2-upgrade
Django 3.2 upgrade
2022-03-17 10:34:22 -04:00
Jeff Bradberry
38ccea0f1f Fix up warnings
- the default auto-increment primary key field type is now
  configurable, and Django's check command issues a warning if you are
  just assuming the historical behavior of using AutoField.

- Django 3.2 brings in automatic AppConfig discovery, so all of our
  explicit `default_app_config = ...` assignments in __init__.py
  modules are no longer needed, and raise a RemovedInDjango41Warning.
2022-03-14 13:19:57 -04:00
Jeff Bradberry
1803c5bdb4 Fix up usage of django-guid
It has replaced the class-based middleware, everything is
function-based now.
2022-03-14 13:19:57 -04:00
Alan Rominger
d4a4ba7fdb Move location of AWX_ISOLATION_SHOW_PATHS so it is editable 2022-03-11 11:08:04 -05:00
Jeff Bradberry
e620bef2a5 Fix Django 3.1 deprecation removal problems
- FieldDoesNotExist now has to be imported from django.core.exceptions
- Django docs specifically say not to import
  django.conf.global_settings, which now has the side-effect of
  triggering one of the check errors
2022-03-07 18:11:36 -05:00
Jeff Bradberry
faa12880a9 Squash a few deprecation warnings
- inspect.getargspec() -> inspect.getfullargspec()
- register pytest.mark.fixture_args
- replace use of DRF's deprecated NullBooleanField
- fix some usage of naive datetimes in the tests
- fix some strings with backslashes that ought to be raw strings
2022-03-07 18:11:36 -05:00
Jeff Bradberry
df61d1a59c Upgrade to Django 3.0
- upgrades
  - Django 3.0.14
  - django-jsonfield 1.4.1 (from 1.2.0)
  - django-oauth-toolkit 1.4.1 (from 1.1.3)
    - Stopping here because later versions have changes to the
      underlying model to support OpenID Connect.  Presumably this can
      be dealt with via a migration in our project.
  - django-guid 2.2.1 (from 2.2.0)
  - django-debug-toolbar 3.2.4 (from 1.11.1)
  - python3-saml 1.13.0 (from 1.9.0)
  - xmlsec 1.3.12 (from 1.3.3)

- Remove our project's use of django.utils.six in favor of directly
  using six, in awx.sso.fields.

- Temporarily monkey patch six back in as django.utils.six, since
  django-jsonfield makes use of that import, and is no longer being
  updated.  Hopefully we can do away with this dependency with the new
  generalized JSONField brought in with Django 3.1.

- Force a json decoder to be used with all instances of JSONField
  brought in by django-jsonfield.  This deals with the 'cast to text'
  problem noted previously in our UPGRADE_BLOCKERS.

- Remove the validate_uris validator from the OAuth2Application in
  migration 0025, per the UPGRADE_BLOCKERS, and remove that note.

- Update the TEMPLATES setting to satisfy Django Debug Toolbar.  It
  requires at least one entry that has APP_DIRS=True, and as near as I
  can tell our custom OPTIONS.loaders setting was effectively doing
  the same thing as Django's own machinery if this setting is set.
2022-03-07 18:11:36 -05:00
Shane McDonald
9f021b780c Move default show paths to production.py
This breaks the dev env
2022-03-07 16:08:58 -05:00
Shane McDonald
a5b888c193 Add default container mounts to AWX_ISOLATION_SHOW_PATHS 2022-03-07 11:45:23 -05:00
Marcelo Moreira de Mello
5e8107621e Allow isolated paths as hostPath volume @ k8s/ocp/container groups 2022-02-28 10:22:20 -05:00
John Westcott IV
cb57752903 Changing session cookie name and added a way for clients to know what the name is #11413 (#11679)
* Changing session cookie name and added a way for clients to know what the key name is
* Adding session information to docs
* Fixing how awxkit gets the session id header
2022-02-27 07:27:25 -05:00
Shane McDonald
88f66d5c51 Enable Podman ipv6 support by default 2022-02-24 08:51:51 -05:00
Elijah DeLee
604cbc1737 Consume control capacity (#11665)
* Select control node before start task

Consume capacity on control nodes for controlling tasks and consider
remainging capacity on control nodes before selecting them.

This depends on the requirement that control and hybrid nodes should all
be in the instance group named 'controlplane'. Many tests do not satisfy that
requirement. I'll update the tests in another commit.

* update tests to use controlplane

We don't start any tasks if we don't have a controlplane instance group

Due to updates to fixtures, update tests to set node type and capacity
explicitly so they get expected result.

* Fixes for accounting of control capacity consumed

Update method is used to account for currently consumed capacity for
instance groups in the in-memory capacity tracking data structure we initialize in
after_lock_init and then update via calculate_capacity_consumed (both in
task_manager.py)

Also update fit_task_to_instance to consider control impact on instances

Trust that these functions do the right thing looking for a
node with capacity, and cut out redundant check for the whole group's
capacity per Alan's reccomendation.

* Refactor now redundant code

Deal with control type tasks before we loop over the preferred instance
groups, which cuts out the need for some redundant logic.

Also, fix a bug where I was missing assigning the execution node in one case!

* set job explanation on tasks that need capacity

move the job explanation for jobs that need capacity to a function
so we can re-use it in the three places we need it.

* project updates always run on the controlplane

Instance group ordering makes no sense on project updates because they
always need to run on the control plane.

Also, since hybrid nodes should always run the control processes for the
jobs running on them as execution nodes, account for this when looking for a
execution node.

* fix misleading message

the variables and wording were both misleading, fix to be more accurate
description in the two different cases where this log may be emitted.

* use settings correctly

use settings.DEFAULT_CONTROL_PLANE_QUEUE_NAME instead of a hardcoded
name
cache the controlplane_ig object during the after lock init to avoid
an uneccesary query
eliminate mistakenly duplicated AWX_CONTROL_PLANE_TASK_IMPACT and use
only AWX_CONTROL_NODE_TASK_IMPACT

* add test for control capacity consumption

add test to verify that when there are 2 jobs and only capacity for one
that one will move into waiting and the other stays in pending

* add test for hybrid node capacity consumption

assert that the hybrid node is used for both control and execution and
capacity is deducted correctly

* add test for task.capacity_type = control

Test that control type tasks have the right capacity consumed and
get assigned to the right instance group

Also fix lint in the tests

* jobs_running not accurate for control nodes

We can either NOT use "idle instances" for control nodes, or we need
to update the jobs_running property on the Instance model to count
jobs where the node is the controller_node.

I didn't do that because it may be an expensive query, and it would be
hard to make it match with jobs_running on the InstanceGroup which
filters on tasks assigned to the instance group.

This change chooses to stop considering "idle" control nodes an option,
since we can't acurrately identify them.

The way things are without any change, is we are continuing to over consume capacity on control nodes
because this method sees all control nodes as "idle" at the beginning
of the task manager run, and then only counts jobs started in that run
in the in-memory tracking. So jobs which last over a number of task
manager runs build up consuming capacity, which is accurately reported
via Instance.consumed_capacity

* Reduce default task impact for control nodes

This is something we can experiment with as far as what users
want at install time, but start with just 1 for now.

* update capacity docs

Describe usage of the new setting and the concept of control impact.

Co-authored-by: Alan Rominger <arominge@redhat.com>
Co-authored-by: Rebeccah <rhunter@redhat.com>
2022-02-14 10:13:22 -05:00
Alex Corey
62b0c2b647 Fix tooltip documentation 2022-02-02 16:18:41 -05:00
Marcelo Moreira de Mello
0fef88c358 Support user customization of container mount options and mount paths 2022-01-21 17:12:32 -05:00
Elijah DeLee
faba64890e Merge pull request #11559 from kdelee/pending_container_group_jobs_take2
Add resource requests to default podspec
2022-01-20 09:54:20 -05:00
John Westcott IV
e63ce9ed08 Api 4XX error msg customization #1236 (#11527)
* Adding API_400_ERROR_LOG_FORMAT setting
* Adding functional tests for API_400_ERROR_LOG_FORMAT
Co-authored-by: nixocio <nixocio@gmail.com>
2022-01-19 11:16:21 -05:00
Elijah DeLee
987924cbda Add resource requests to default podspec
Extend the timeout, assuming that we want to let the kubernetes scheduler
start containers when it wants to start them. This allows us to make
resource requests knowing that when some jobs queue up waiting for
resources, they will not get reaped in as short of a
timeout.
2022-01-18 13:34:39 -05:00
Amol Gautam
a4a3ba65d7 Refactored tasks.py to a package
--- Added 3 new sub-package : awx.main.tasks.system , awx.main.tasks.jobs , awx.main.tasks.receptor
--- Modified the functional tests and unit tests accordingly
2022-01-14 11:55:41 -05:00
Jeff Bradberry
db999b82ed Merge pull request #11431 from jbradberry/receptor-mesh-models
Modify Instance and introduce InstanceLink
2022-01-11 10:55:54 -05:00
John Westcott IV
c92468062d SAML user attribute flags issue #5303 (PR #11430)
* Adding SAML option in SAML configuration to specify system auditor and system superusers by role or attribute
* Adding keycloak container and documentation on how to start keycloak alongside AWX (including configuration of both)
2022-01-10 16:52:44 -05:00
Seth Foster
956638e564 Revert "Remove unnecessary DEBUG logger level settings (#11441)"
This reverts commit 8126f734e3.
2022-01-10 11:46:19 -05:00
Jeff Bradberry
f1c5da7026 Remove the auto-discovery feature 2022-01-10 11:37:19 -05:00