Commit Graph

53 Commits

Author SHA1 Message Date
Ryan Petrello
ff1e8cc356 replace celery task decorators with a kombu-based publisher
this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:

- a new decorator, `awx.main.dispatch.task`, which can be used to
  decorate functions or classes so that they can be designated as
  "Tasks"
- support for fanout/broadcast tasks (at this point in time, only
  `conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
  `handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
  on demand
- minimal support for RPC, such as status checks and pool recycle/reload
2018-10-11 10:53:30 -04:00
Ryan Petrello
67d1267d98 enforce 0 <= Instance.capacity_adjustment
see: https://github.com/ansible/tower/issues/2839
2018-08-21 15:34:19 -04:00
AlanCoding
a99ebbb02f Apply policy task more selectively 2018-08-03 06:56:15 -04:00
Ryan Petrello
3cdd0a94bd simplify dynamic queue binding
we recently made a change so that instances no longer bind to
instance-group specific queues, but now instead they each bind to
a direct queue for their specific hostname
(https://github.com/ansible/tower/pull/1922)

Because of this, we shouldn't *need* to reconfigure the queue binds at
runtime anymore when group membership changes. Under our new model,
every celeryd listens on a queue named after its hostname; when the
scheduler finds a task to run, it picks an Instance in the target
Instance Group and sends the task to the queue for that Instance's
hostname.
2018-07-28 00:48:37 -04:00
Ryan Petrello
15aaca8f03 make InstanceGroup.policy_instance_list non-exclusive by default
see: https://github.com/ansible/tower/issues/2583
2018-07-25 17:26:13 -04:00
chris meyers
aeca21ab5b deny topology changes to iso instances via api 2018-07-06 14:50:17 -04:00
Ryan Petrello
ef6433c6f9 Revert "fix celery task reaper"
This reverts commit 1359208a99.
2018-06-18 16:18:20 -04:00
chris meyers
1359208a99 fix celery task reaper
* celery workers have internal queue names that are named after the
system hostname. This may differ from what tower knows the host by,
Instance.hostname
This adds a mapping so we can convert internal celery names to Instance
names for purposes of reaping jobs.
2018-06-15 16:56:53 -04:00
Ryan Petrello
84eacfc360 fix a few isolated dev issues
the main goal of this change is to make `make docker-isolated` work out
of the box

- specify the proper version for awx-expect --version
- update some deprecated playbook bits
- change the isolated container to privileged so bwrap will work
- fix awx-manage test_isolated_connection
- expedite the first isolated heartbeat so you don't have to wait 10m;
  this is accomplished by _not_ setting Instance.last_isolated_check to
  now() at insertion time (which causes the next check not to happen for
  10 minutes)
- fix a bug that caused isolated node execution to fail when bwrap was
  enabled

see: https://github.com/ansible/tower/issues/2150

This reverts commit 9863fe71dc.
2018-06-13 14:17:58 -04:00
chris meyers
b876c2af62 add total job count for instance + instance group 2018-06-05 12:05:22 -04:00
chris meyers
b94cf379f6 do not choose offline instances 2018-06-04 10:06:59 -04:00
chris meyers
e720fe5dd0 decide the node a job will run early
* Deciding the Instance that a Job runs on at celery task run-time makes
it hard to evenly distribute tasks among Instnaces. Instead, the task
manager will look at the world of running jobs and choose an instance
node to run on; applying a deterministic job distribution algo.
2018-06-04 10:06:59 -04:00
Ryan Petrello
4abac44411 prevent unicode in instance hostnames and instance group names
see: https://github.com/ansible/tower/issues/1721
2018-05-18 16:28:43 -04:00
Matthew Jones
05419d010b Update group cluster policies on save, not just created
Currently updating policy settings doesn't trigger a re-evaluation of
instance group policies, this makes sure we re-evaluate in the event
that anything changes.
2018-04-24 21:40:11 -04:00
chris meyers
c3100afd0e fixed isolated instance query
* Was considering an isolated instance: any instance that has at least 1
group with no controller. This is technically correct since an iso node
can not be a part of a non-iso group.
* The query is now more robust and considers a node an iso node if ALL
groups that a node belong to ALL have a controller.
* Also added better debugging for the special tower instance group
* Added a check for the existance of the special tower group so that
logs are less "messy" during the install process.
2018-04-03 13:50:57 -04:00
Chris Meyers
47fa99d3ad Merge pull request #1154 from chrismeyersfsu/enhancement-tower_in_all_groups
add all instances to special tower instance group
2018-04-02 09:39:04 -04:00
chris meyers
838b723c73 add all instances to special tower instance group
* All instances except isolated instances
* Also, prevent any tower attributes from being modified via the API
2018-03-29 16:47:52 -04:00
chris meyers
8438331563 make jobs_running more rich in OPTIONS
* Expose jobs_running as an IntegerField
2018-03-28 16:01:24 -04:00
AlanCoding
7881c921ac block deletion of resources w unprocessed events 2018-03-16 10:14:28 -04:00
chris meyers
5d5d8152c5 prevent instance group delete if running jobs
* related to https://github.com/ansible/ansible-tower/issues/7936
2018-03-15 14:25:49 -04:00
Matthew Jones
70bf78e29f Apply capacity algorithm changes
* This also adds fields to the instance view for tracking cpu and
  memory usage as well as information on what the capacity ranges are
* Also adds a flag for enabling/disabling instances which removes them
  from all queues and has them stop processing new work
* The capacity is now based almost exclusively on some value relative
  to forks
* capacity_adjustment allows you to commit an instance to a certain
  amount of forks, cpu focused or memory focused
* Each job run adds a single fork overhead (that's the reasoning
  behind the +1)
2018-02-01 16:57:09 -05:00
Matthew Jones
6e9930a45f Use on_commit hook for triggering ig policy
* also Apply console handlers to loggers for dev environment
2018-02-01 16:56:43 -05:00
Matthew Jones
d9e774c4b6 Updates for automatic triggering of policies
* Switch policy router queue to not be "tower" so that we don't
  fall into a chicken/egg scenario
* Show fixed policy list in serializer so a user can determine if
  an instance is manually managed
* Change IG membership mixin to not directly handle applying topology
  changes. Instead it just makes sure the policy instance list is
  accurate
* Add create/delete hooks for instances and groups to trigger policy
  re-evaluation
* Update policy algorithm for fairer distribution
* Fix an issue where CELERY_ROUTES wasn't renamed after celery/django
  upgrade
* Update unit tests to be more explicit
* Update count calculations used by algorithm to only consider
  non-manual instances
* Adding unit tests and fixture
* Don't propagate logging messages from awx.main.tasks and
  awx.main.scheduler
* Use advisory lock to prevent policy eval conflicts
* Allow updating instance groups from view
2018-02-01 16:56:16 -05:00
Matthew Jones
56abfa732e Adding initial instance group policies
and policy evaluation planner
2018-02-01 16:56:15 -05:00
Chris Meyers
c9ff3e99b8 celeryd attach to queues dynamically
* Based on the tower topology (Instance and InstanceGroup
relationships), have celery dyamically listen to queues on boot
* Add celery task capable of "refreshing" what queues each celeryd
worker listens to. This will be used to support changes in the topology.
* Cleaned up some celery task definitions.
* Converged wrongly targeted job launch/finish messages to 'tower'
queue, rather than a 1-off queue.
* Dynamically route celery tasks destined for the local node
* separate beat process

add support for separate beat process
2018-02-01 16:37:33 -05:00
AlanCoding
5327a4c622 Use global capacity algorithm in serializer
The task manager was doing work to compute currently consumed
capacity, this is moved into the manager and applied in the
same form to the instance group list.
2017-08-28 12:07:47 -04:00
AlanCoding
1112557c79 set capacity to 0 if instance has not checked in lately 2017-07-27 16:20:04 -04:00
Matthew Jones
c7a85d9738 Mass rename from ansible_(awx|tower) -> (awx|tower) 2017-07-26 13:33:26 -04:00
AlanCoding
dd1a261bc3 setup playbook and heartbeat for isolated deployments
* Allow isolated_group_ use in setup playbook
* Tweaks to host/queue registration commands complementing setup
* Create isolated heartbeat task and check capacity
* Add content about isolated instances to acceptance docs
2017-06-19 12:13:36 -04:00
Ryan Petrello
422950f45d Support for executing job and adhoc commands on isolated Tower nodes (#6524) 2017-06-14 11:47:30 -04:00
Aaron Tan
604243428c Add URL and type fields to instances/instance groups. 2017-05-30 17:00:27 -04:00
Matthew Jones
705f8af440 Update views and serializers to support instance group (ramparts)
* includes top level views for instances and instance groups and
  extending those views to be able to view running jobs
* Associative endpoints on Organizations, Inventories, and Job
  Templates
* Related and summary field entries where appropriate
* Adding job model references to executing instance group
* Fix up default queue properties for clustering from the settings file
* Update production and default settings for instance queues in settings
2017-05-10 12:33:03 -04:00
Matthew Jones
4ced911c00 Implementing models for instance groups, updating task manager
* New InstanceGroup model and associative relationship with Instances
* Associative instances between Organizations, Inventory, and Job
  Templates and InstanceGroups
* Migrations for adding fields and tables for Instance Groups
* Adding activity stream reference for instance groups
* Task Manager Refactoring:
** Simplifying task manager relationships and move away from the
   interstitial hash tables
** Simplify dependency determination logic
** Reduce task manager runtime complexity by removing the partial
   references and moving the logic into the task manager directly or
   relying on Job model logic for determinism
2017-05-10 12:32:54 -04:00
Matthew Jones
ea8b78ca49 Protect cluster nodes after an upgrade
* Modify instance model to container a version number for the node
* Update that version number during the heartbeat
* If during a heartbeat any of the nodes are of a newer version then
  shutdown the current node.

The idea behind this is that if all nodes were upgraded at the same
time then at the moment of the healthcheck they should all be at the
newer version. Otherwise we put the system in a state where it can
receive the upgrade but stay down until that happens. During setup
playbook run the services will be fully restarted.
2017-04-10 15:37:33 -04:00
Aaron Tan
9e4655419e Fix flake8 E302 errors. 2016-11-15 20:59:39 -05:00
Matthew Jones
343966f744 Implement gathering overall task capacity
For use when running/planning jobs
2016-11-07 13:45:01 -05:00
Chris Meyers
25b85c4a0b rename scheduler config singleton 2016-11-01 14:07:00 -05:00
Chris Meyers
13c89ab78c HAify job schedules and more task_manager renaming 2016-11-01 13:50:42 -05:00
Matthew Jones
3de4aae548 Fixing up HA induced flake8 issues 2016-09-15 13:51:17 -04:00
Matthew Jones
0c1e1fa2fb Refactor Tower HA Instance logic and models
* Gut the HA middleware
* Purge concept of primary and secondary.
* UUID is not the primary host identifier, now it's based mostly on the
  username.  Some work probably still left to do to make sure this is
  legit.  Also removed unique constraint from the uuid field.  This
  might become the cluster ident now... or it may just deprecate
* No more secondary -> primary redirection
* Initial revision of /api/v1/ping
* Revise and gut tower-manage register_instance
* Rename awx/main/socket.py to awx/main/socket_queue.py to prevent
  conflict with the "socket" module from python base
* Revist/gut the Instance manager... not sure if this manager is really
  needed anymore
2016-09-08 13:37:53 -04:00
John Mitchell
32d1c0e4db fixed copyright date 2015-06-11 16:10:23 -04:00
Matthew Jones
31d0342d41 More copyright headers for api side stuff 2015-05-29 12:10:40 -04:00
Matthew Jones
b3da3b34a3 Changing some legal headers for python source files 2015-05-29 12:10:39 -04:00
Luke Sneeringer
d6699353e5 Save hostnames, not IP addresses, for HA. 2014-12-02 10:34:25 -06:00
Luke Sneeringer
ec1c770099 Added job cancelation when switching to secondary. 2014-10-20 08:04:17 -05:00
Luke Sneeringer
e23801313e Do a OneToOne with UnifiedJob. 2014-10-20 08:04:17 -05:00
Luke Sneeringer
f6a501bb28 Register signal against subclasses separately. 2014-10-20 08:04:17 -05:00
Luke Sneeringer
a72e72f20f Adding a post_save receiver. 2014-10-20 08:04:17 -05:00
Luke Sneeringer
ec7aa1867f Adding JobOrigin model and migration. 2014-10-20 08:04:16 -05:00
Luke Sneeringer
60dae748dc Adding IP Address to the instance.
Needed for redirecting.
2014-10-20 08:04:16 -05:00