this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:
- a new decorator, `awx.main.dispatch.task`, which can be used to
decorate functions or classes so that they can be designated as
"Tasks"
- support for fanout/broadcast tasks (at this point in time, only
`conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
`handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
on demand
- minimal support for RPC, such as status checks and pool recycle/reload
we recently made a change so that instances no longer bind to
instance-group specific queues, but now instead they each bind to
a direct queue for their specific hostname
(https://github.com/ansible/tower/pull/1922)
Because of this, we shouldn't *need* to reconfigure the queue binds at
runtime anymore when group membership changes. Under our new model,
every celeryd listens on a queue named after its hostname; when the
scheduler finds a task to run, it picks an Instance in the target
Instance Group and sends the task to the queue for that Instance's
hostname.
* celery workers have internal queue names that are named after the
system hostname. This may differ from what tower knows the host by,
Instance.hostname
This adds a mapping so we can convert internal celery names to Instance
names for purposes of reaping jobs.
the main goal of this change is to make `make docker-isolated` work out
of the box
- specify the proper version for awx-expect --version
- update some deprecated playbook bits
- change the isolated container to privileged so bwrap will work
- fix awx-manage test_isolated_connection
- expedite the first isolated heartbeat so you don't have to wait 10m;
this is accomplished by _not_ setting Instance.last_isolated_check to
now() at insertion time (which causes the next check not to happen for
10 minutes)
- fix a bug that caused isolated node execution to fail when bwrap was
enabled
see: https://github.com/ansible/tower/issues/2150
This reverts commit 9863fe71dc.
* Deciding the Instance that a Job runs on at celery task run-time makes
it hard to evenly distribute tasks among Instnaces. Instead, the task
manager will look at the world of running jobs and choose an instance
node to run on; applying a deterministic job distribution algo.
Currently updating policy settings doesn't trigger a re-evaluation of
instance group policies, this makes sure we re-evaluate in the event
that anything changes.
* Was considering an isolated instance: any instance that has at least 1
group with no controller. This is technically correct since an iso node
can not be a part of a non-iso group.
* The query is now more robust and considers a node an iso node if ALL
groups that a node belong to ALL have a controller.
* Also added better debugging for the special tower instance group
* Added a check for the existance of the special tower group so that
logs are less "messy" during the install process.
* This also adds fields to the instance view for tracking cpu and
memory usage as well as information on what the capacity ranges are
* Also adds a flag for enabling/disabling instances which removes them
from all queues and has them stop processing new work
* The capacity is now based almost exclusively on some value relative
to forks
* capacity_adjustment allows you to commit an instance to a certain
amount of forks, cpu focused or memory focused
* Each job run adds a single fork overhead (that's the reasoning
behind the +1)
* Switch policy router queue to not be "tower" so that we don't
fall into a chicken/egg scenario
* Show fixed policy list in serializer so a user can determine if
an instance is manually managed
* Change IG membership mixin to not directly handle applying topology
changes. Instead it just makes sure the policy instance list is
accurate
* Add create/delete hooks for instances and groups to trigger policy
re-evaluation
* Update policy algorithm for fairer distribution
* Fix an issue where CELERY_ROUTES wasn't renamed after celery/django
upgrade
* Update unit tests to be more explicit
* Update count calculations used by algorithm to only consider
non-manual instances
* Adding unit tests and fixture
* Don't propagate logging messages from awx.main.tasks and
awx.main.scheduler
* Use advisory lock to prevent policy eval conflicts
* Allow updating instance groups from view
* Based on the tower topology (Instance and InstanceGroup
relationships), have celery dyamically listen to queues on boot
* Add celery task capable of "refreshing" what queues each celeryd
worker listens to. This will be used to support changes in the topology.
* Cleaned up some celery task definitions.
* Converged wrongly targeted job launch/finish messages to 'tower'
queue, rather than a 1-off queue.
* Dynamically route celery tasks destined for the local node
* separate beat process
add support for separate beat process
The task manager was doing work to compute currently consumed
capacity, this is moved into the manager and applied in the
same form to the instance group list.
* includes top level views for instances and instance groups and
extending those views to be able to view running jobs
* Associative endpoints on Organizations, Inventories, and Job
Templates
* Related and summary field entries where appropriate
* Adding job model references to executing instance group
* Fix up default queue properties for clustering from the settings file
* Update production and default settings for instance queues in settings
* New InstanceGroup model and associative relationship with Instances
* Associative instances between Organizations, Inventory, and Job
Templates and InstanceGroups
* Migrations for adding fields and tables for Instance Groups
* Adding activity stream reference for instance groups
* Task Manager Refactoring:
** Simplifying task manager relationships and move away from the
interstitial hash tables
** Simplify dependency determination logic
** Reduce task manager runtime complexity by removing the partial
references and moving the logic into the task manager directly or
relying on Job model logic for determinism
* Modify instance model to container a version number for the node
* Update that version number during the heartbeat
* If during a heartbeat any of the nodes are of a newer version then
shutdown the current node.
The idea behind this is that if all nodes were upgraded at the same
time then at the moment of the healthcheck they should all be at the
newer version. Otherwise we put the system in a state where it can
receive the upgrade but stay down until that happens. During setup
playbook run the services will be fully restarted.
* Gut the HA middleware
* Purge concept of primary and secondary.
* UUID is not the primary host identifier, now it's based mostly on the
username. Some work probably still left to do to make sure this is
legit. Also removed unique constraint from the uuid field. This
might become the cluster ident now... or it may just deprecate
* No more secondary -> primary redirection
* Initial revision of /api/v1/ping
* Revise and gut tower-manage register_instance
* Rename awx/main/socket.py to awx/main/socket_queue.py to prevent
conflict with the "socket" module from python base
* Revist/gut the Instance manager... not sure if this manager is really
needed anymore