Situations have come up where the 5+ minute kill signal for
run_task_manager is emitted to the worker process running it, but
since the worker improperly inherited the AWXConsumerBase().stop()
handler a deadlock ultimately was triggered on the database
connection.
this resolves an issue that causes an endless hang on with Cyberark AIM
lookups when a certificate *and* key are specified
the underlying issue here is that we can't rely on the underyling Python
ssl implementation to *only* read from the fifo that stores the pem data
*only once*; in reality, we need to just use *actual* tempfiles for
stability purposes
see: https://github.com/ansible/awx/issues/6986
see: https://github.com/urllib3/urllib3/issues/1880
this change fixes a bug introduced in the optimization at https://github.com/ansible/awx/pull/7352
1. Create inventory with multiple hosts
2. Run a playbook with a limit to match only one host
3. Run job, verify that it only acts on the one host
4. Go to inventory host list and see that all the hosts have last_job updated to point to the job that only acted on one host.
There is some history here.
https://github.com/ansible/awx/pull/7190 <- This PR was an attempt at fixing a
bug notting ran into where some jobs on k8s installs would get stuck in Waiting
forever.
The PR mentioned above introduced a bug where there are no instance groups on a
fresh k8s-based install. This is because this process currently happens in the
launch scripts, before the database is up.
With this patch, queue / instance group registration happens in the heartbeat,
right after auto-registering the instance.
When targeting, ../workflow_job_templates/id#/workflow_nodes/ endpoint,
user could not set all_parents_must_converge to true.
3.7.1 backport for awx issue #7063
* broadcast websockets have stats tracked (i.e. connection status,
number of messages total, messages per minute, etc). Previous to this
change, stats were tracked by ip address, if it was defined on the
instance, XOR hostname. This changeset tracks stats by hostname.
* The websocket backplane interconnect is done via ip address for
Kubernetes and OpenShift. On init run_wsbroadcast reads all Instances
from the DB and makes a decision to use the ip address or the hostname
based, with preference given to the ip address if defined. For
Kubernetes and OpenShift the nodes can load the Instance before the
ip_address is set. This would cause the connection to be tried by
hostname rather than ip address. This changeset ensures that an ip
address set after an Instance record is created will be detected and
used.
fixes schema differences from script
add back in default groups from script
change hostnames to reflect script
add in some hostvars
Generally allow giving plugin options from source variables
allows testing with insecure connection with ovirt_insecure
this is a behavior change from the script
* There are 2 data-structures that django channels redis uses: (1) zset
and (2) list. (1) is used for group membership where the key is the
logic user group and the value(s) are websocket clients. The score of
the zset entry is used for group expiration. We can not rely on group
expiration for clean-up because there is no interface privided by redis
channels to refresh the expiration. Choosing a small value for
group_expiry could result on our websocket backplane group expiring,
which would result in job events not being delivered. Instead, we
increase the group expiration to 5 years and clean up on daphne service
start.
* The list (2) data-structure is used by django channels redis to queue
websocket events per-websocket-client as needed. The need arises to
queue per-websocket-client events when the consumer can not keep up with
the producer. The consumer here is daphne, the producer is AWX.
* When AWX is operating healthy group membership in Redis is reflective
of the real-world. When AWX is unhealthy i.e. daphne cycles, the zset
will contain stale websocket client entries. This can be observed by
running `zrange asgi::group:jobs-status_changed 0 -1`. If the entries
returned look like:
specific.fUkXXpYj!DKOIfwPICNgw
specific.fUkXXpYj!FQcdopZeiRdG
specific.lpTSAgnk!IOKldfzcfdDp
specific.lpTSAgnk!NbvRUZsDpIQx
The entries with `fUkXXpYj` are stale. Note that this changeset fixes
this by removing all `asgi:*` entries on daphne start.
* Also note that individual message themselves have an expiration that
is configurable and defaults to 60.
* Also note that zset's tracking group membership will be deleted by
django channels redis when they are empty.