With the change to use pk-based interval slicing for the job events
table, we need analytics.gather to be the code that manages all of the
"expensive" collector slicing. While we are at it, let's ship each
chunked tarball file as we produce it.
- Adds a Metrics() class that can track data such as number of
events the callback receiver inserted into database
- Exposes this metric data at the api/v2/metrics/ endpoint.
This data is prometheus-friendly
- Metric data is stored in memory, then periodically saved to Redis.
- Metric data is periodically broadcast to other nodes in the cluster,
so that each node has a copy of the most recent metric data collected.
- In K8S-based installs, only container groups are intended to be used
for playbook execution (JTs, adhoc, inventory updates), so in this
scenario, other job types have a task impact of zero.
- In K8S-based installs, traditional instances have *zero* capacity
(because they're only members of the control plane where services
- http/s, local control plane execution - run)
- This commit also includes some changes that allow for the task manager
to launch tasks with task_impact=0 on instances that have capacity=0
(previously, an instance with zero capacity would never be selected
as the "execution node"
This means that when IS_K8S=True, any Job Template associated with an
Instance Group will never actually go from pending -> running (because
there's no capacity - all playbooks must run through Container Groups).
For an improved ux, our intention is to introduce logic into the
operator install process such that the *default* group that's created at
install time is a *Container Group* that's configured to point at the
K8S cluster where awx itself is deployed.
- a new unique name field to EE
- a new configure-Tower-in-Tower setting DEFAULT_EXECUTION_ENVIRONMENT
- an Org-level execution_environment_admin_role
- a default_environment field on Project
- a new Container Registry credential type
- order EEs by reverse of the created timestamp
- a method to resolve which EE to use on jobs
this middleware allready existed, and we were trying to log this
data but it was not working.
Hope is these logs will be able to be shipped via external logging
and we could use kibana to track response time of different endpoints
Various points (e.g. created, running, processing events), are
structured into json format and output to /var/log/tower/job_lifecycle.log
As part of this work, the DependencyGraph is reworked to return
which job object is doing the blocking, rather than a boolean.
* The cron ran logrotate will now rotate our log files instead of python
* If not error log file is specified in the config then do not include
it as a paremter to rsyslog omhttp module. This is useful for
containers.
* The namespace for isolated logging was not enabled. Add a handler and
logger so that it's enabled. This is particularly useful when the
logging level is switched to DEBUG
instead, just have each worker connect directly to redis
this has a few benefits:
- it's simpler to explain and debug
- back pressure on the queue keeps messages around in redis (which is
observable, and survives the restart of Python processes)
- it's likely notably more performant at high loads
The analytics change PR adjusted the logging for awx.analytics,
which solved the issue, but should have used the targeted awx.main.analytics.
Also flip a couple of loggers to use the regular awx.analytics (awx analytics)
logger instead of awx.main.analytics (the automation anayltics task system).
Errors/warnings when gathering analytics are about 50/50 split between
the gathering code in analytics and the task code that calls it, so
they should be in the same place for debugging sanity.