* Deciding the Instance that a Job runs on at celery task run-time makes
it hard to evenly distribute tasks among Instnaces. Instead, the task
manager will look at the world of running jobs and choose an instance
node to run on; applying a deterministic job distribution algo.
Currently updating policy settings doesn't trigger a re-evaluation of
instance group policies, this makes sure we re-evaluate in the event
that anything changes.
* Was considering an isolated instance: any instance that has at least 1
group with no controller. This is technically correct since an iso node
can not be a part of a non-iso group.
* The query is now more robust and considers a node an iso node if ALL
groups that a node belong to ALL have a controller.
* Also added better debugging for the special tower instance group
* Added a check for the existance of the special tower group so that
logs are less "messy" during the install process.
* This also adds fields to the instance view for tracking cpu and
memory usage as well as information on what the capacity ranges are
* Also adds a flag for enabling/disabling instances which removes them
from all queues and has them stop processing new work
* The capacity is now based almost exclusively on some value relative
to forks
* capacity_adjustment allows you to commit an instance to a certain
amount of forks, cpu focused or memory focused
* Each job run adds a single fork overhead (that's the reasoning
behind the +1)
* Switch policy router queue to not be "tower" so that we don't
fall into a chicken/egg scenario
* Show fixed policy list in serializer so a user can determine if
an instance is manually managed
* Change IG membership mixin to not directly handle applying topology
changes. Instead it just makes sure the policy instance list is
accurate
* Add create/delete hooks for instances and groups to trigger policy
re-evaluation
* Update policy algorithm for fairer distribution
* Fix an issue where CELERY_ROUTES wasn't renamed after celery/django
upgrade
* Update unit tests to be more explicit
* Update count calculations used by algorithm to only consider
non-manual instances
* Adding unit tests and fixture
* Don't propagate logging messages from awx.main.tasks and
awx.main.scheduler
* Use advisory lock to prevent policy eval conflicts
* Allow updating instance groups from view
* Based on the tower topology (Instance and InstanceGroup
relationships), have celery dyamically listen to queues on boot
* Add celery task capable of "refreshing" what queues each celeryd
worker listens to. This will be used to support changes in the topology.
* Cleaned up some celery task definitions.
* Converged wrongly targeted job launch/finish messages to 'tower'
queue, rather than a 1-off queue.
* Dynamically route celery tasks destined for the local node
* separate beat process
add support for separate beat process
The task manager was doing work to compute currently consumed
capacity, this is moved into the manager and applied in the
same form to the instance group list.
* includes top level views for instances and instance groups and
extending those views to be able to view running jobs
* Associative endpoints on Organizations, Inventories, and Job
Templates
* Related and summary field entries where appropriate
* Adding job model references to executing instance group
* Fix up default queue properties for clustering from the settings file
* Update production and default settings for instance queues in settings
* New InstanceGroup model and associative relationship with Instances
* Associative instances between Organizations, Inventory, and Job
Templates and InstanceGroups
* Migrations for adding fields and tables for Instance Groups
* Adding activity stream reference for instance groups
* Task Manager Refactoring:
** Simplifying task manager relationships and move away from the
interstitial hash tables
** Simplify dependency determination logic
** Reduce task manager runtime complexity by removing the partial
references and moving the logic into the task manager directly or
relying on Job model logic for determinism
* Modify instance model to container a version number for the node
* Update that version number during the heartbeat
* If during a heartbeat any of the nodes are of a newer version then
shutdown the current node.
The idea behind this is that if all nodes were upgraded at the same
time then at the moment of the healthcheck they should all be at the
newer version. Otherwise we put the system in a state where it can
receive the upgrade but stay down until that happens. During setup
playbook run the services will be fully restarted.
* Gut the HA middleware
* Purge concept of primary and secondary.
* UUID is not the primary host identifier, now it's based mostly on the
username. Some work probably still left to do to make sure this is
legit. Also removed unique constraint from the uuid field. This
might become the cluster ident now... or it may just deprecate
* No more secondary -> primary redirection
* Initial revision of /api/v1/ping
* Revise and gut tower-manage register_instance
* Rename awx/main/socket.py to awx/main/socket_queue.py to prevent
conflict with the "socket" module from python base
* Revist/gut the Instance manager... not sure if this manager is really
needed anymore