this commit implements the bulk of `awx-manage run_dispatcher`, a new
command that binds to RabbitMQ via kombu and balances messages across
a pool of workers that are similar to celeryd workers in spirit.
Specifically, this includes:
- a new decorator, `awx.main.dispatch.task`, which can be used to
decorate functions or classes so that they can be designated as
"Tasks"
- support for fanout/broadcast tasks (at this point in time, only
`conf.Setting` memcached flushes use this functionality)
- support for job reaping
- support for success/failure hooks for job runs (i.e.,
`handle_work_success` and `handle_work_error`)
- support for auto scaling worker pool that scale processes up and down
on demand
- minimal support for RPC, such as status checks and pool recycle/reload
If database connectivity is lost/interrupted in this block of celery
internals, beat is *not* smart enough to recover, and it gets stuck in
an endless fail loop. We don't _need_ to talk to the database here
anyways; just use settings.CLUSTER_HOST_ID to get what we need.
see: https://github.com/ansible/tower/issues/2957
we recently made a change so that instances no longer bind to
instance-group specific queues, but now instead they each bind to
a direct queue for their specific hostname
(https://github.com/ansible/tower/pull/1922)
Because of this, we shouldn't *need* to reconfigure the queue binds at
runtime anymore when group membership changes. Under our new model,
every celeryd listens on a queue named after its hostname; when the
scheduler finds a task to run, it picks an Instance in the target
Instance Group and sends the task to the queue for that Instance's
hostname.
* We now target Instances in the task manager when transitioning jobs
from pending to waiting; whereas before we submitted jobs to Instance
Groups to be picked up by Instance's in those Instance Groups.
Subscribing Instances to their Instance Groups is no longer needed. This
change removes the Instance Group queue subscription.
* Each time a route is needed (i.e. when a task is sumitted to celery).
The router will be queried. This is ideal. With the previous method we
had to consider how a change in the routes would propogate to all celery
workers and nodes.
* fully describe the default awx queue
* Our dynamic queue registration would correct awx_private_queue.
However, we don't want celery to even create an "invalid"/extra
queue-exchange-route. This change makes sure we don't create extranious
things in rabbitmq.
* reduce the cluster queue registration output. Only output when the
queue registration list changes.
refactor existing handlers to be the related
"real" handler classes, which are swapped
out dynamically by external logger "proxy" handler class
real handler swapout only done on setting change
remove restart_local_services method
get rid of uWSGI fifo file
change TCP/UDP return type contract so that it mirrors
the request futures object
add details to socket error messages
* Using Kombu's default Broadcast() constructor requires only 1
parameter. That parameter defines the exchange name and the queue name
is randomly generated per-node.
* This caused problems if/when celery enters an infinite restart loop
because too many rabbit queues get created and rabbit OOM's
(gracefully).
* To remedy this we tell Broadcast the queue name to use, which is
derived from some constant + the node name so that the per-node queue
name is stable.
* Before, we had a special group, tower, that ran any async work that
tower needed done. This allowed users fine grain control over which
nodes did background work. However, this granularity was too complicated
for users. So now, all tower system work goes to a special non-user
exposed celery queue. Tower remains the fallback instance group to
execute jobs on. The tower group will be created upon install and
protected from deletion.
In verbose unified job models (inventory updates, system jobs,
etc.), do not delay dispatch just because the encoded
event data is not part of the data written to the buffer.
This allows output from these commands to be submitted
to the callback queue as they are produced, instead
of waiting until the buffer is closed.
* This also adds fields to the instance view for tracking cpu and
memory usage as well as information on what the capacity ranges are
* Also adds a flag for enabling/disabling instances which removes them
from all queues and has them stop processing new work
* The capacity is now based almost exclusively on some value relative
to forks
* capacity_adjustment allows you to commit an instance to a certain
amount of forks, cpu focused or memory focused
* Each job run adds a single fork overhead (that's the reasoning
behind the +1)
* Switch policy router queue to not be "tower" so that we don't
fall into a chicken/egg scenario
* Show fixed policy list in serializer so a user can determine if
an instance is manually managed
* Change IG membership mixin to not directly handle applying topology
changes. Instead it just makes sure the policy instance list is
accurate
* Add create/delete hooks for instances and groups to trigger policy
re-evaluation
* Update policy algorithm for fairer distribution
* Fix an issue where CELERY_ROUTES wasn't renamed after celery/django
upgrade
* Update unit tests to be more explicit
* Update count calculations used by algorithm to only consider
non-manual instances
* Adding unit tests and fixture
* Don't propagate logging messages from awx.main.tasks and
awx.main.scheduler
* Use advisory lock to prevent policy eval conflicts
* Allow updating instance groups from view
* Based on the tower topology (Instance and InstanceGroup
relationships), have celery dyamically listen to queues on boot
* Add celery task capable of "refreshing" what queues each celeryd
worker listens to. This will be used to support changes in the topology.
* Cleaned up some celery task definitions.
* Converged wrongly targeted job launch/finish messages to 'tower'
queue, rather than a 1-off queue.
* Dynamically route celery tasks destined for the local node
* separate beat process
add support for separate beat process
update our event data search algorithm to be a bit lazier in event data
discovery; this drastically improves processing speeds for stdout >5MB
see: https://github.com/ansible/awx/issues/417
The original tests set no longer works after Django 1.11 due to more
strict rules against dynamic model definition. The refactored tests set
aims at each existing model that apply named URL rules, instead of
abstract general use cases, thus significantly improves maintainability
and readability.
Signed-off-by: Aaron Tan <jangsutsr@gmail.com>
this approach totally removes the process of reading and writing stdout
files on the local file system at settings.JOBOUTPUT_ROOT when jobs are
run; now stdout content is only written on-demand as it's fetched for
the deprecated `stdout` endpoint
see: https://github.com/ansible/awx/issues/200
* release_3.2.1:
fallback to empty dict when processing extra_data
fix migration problem from 3.1.1
move 0005a migration to 0005b
feedback on ad hoc prohibited vars error msg
Fix the way we include i18n files in sdist
Fix migrations to support 3.1.2 -> 3.2.1+ upgrade path
fix missing parameter to update_capacity method
fix WARNING log when launching ad hoc command
Validate against ansible variables on ad hoc launch
do not allow ansible connection type of local for ad_hoc
work around an ansible 2.4 inventory caching bug
fix scan job migration unicode issue
Assert isolated nodes have capacity set to 0 and restored based on version
Set capacity to zero if the isolated node has an old version