* perf: stop eagerly updating Host.last_job_host_summary on every job completion
The playbook_on_stats wrapup path bulk-updates last_job_host_summary_id
on every host touched by a job. In the Q4CY25 scale lab this query had
a median execution time of 75 seconds due to index churn on main_host.
Replace all reads of the denormalized FK with a new classmethod
JobHostSummary.latest_for_host(host_id) that queries for the most
recent summary on demand. This eliminates the write-side bulk_update
of last_job_host_summary_id entirely.
Changes:
- Add JobHostSummary.latest_for_host() classmethod
- Serializer: use latest_for_host() instead of obj.last_job_host_summary
- Dashboard view: use subquery instead of FK traversal for failed hosts
- Inventory.update_computed_fields: use subquery for failed host count
- events.py: remove last_job_host_summary_id from bulk_update
- signals.py: simplify _update_host_last_jhs to only update last_job
- access.py/managers.py: remove select_related/defer through the FK
The FK field on Host is left in place for now (removal requires a
migration) but is no longer written to.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix .pk AttributeError, add job_template annotations, annotate host sublists
- Add 'pk' to AnnotatedSummary dynamic type (fixes AttributeError in get_related)
- Add job_template_id and job_template_name to subquery annotations so list
views include these fields in summary_fields.last_job (matching detail views)
- Traverse job__ FK from JobHostSummary instead of using separate UnifiedJob
subquery with OuterRef on another annotation (cleaner SQL, avoids alias issue)
- Annotate all host sublist views (InventoryHostsList, GroupHostsList,
GroupAllHostsList, InventorySourceHostsList) to prevent N+1 queries
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Update test_events to use JobHostSummary.latest_for_host instead of stale FKs
Tests were asserting host.last_job_id and host.last_job_host_summary_id
which are no longer updated. Use JobHostSummary.latest_for_host() to
derive the same data, matching the new read-time derivation approach.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Remove stale failures_url from deprecated DashboardView
The failures_url linked to ?last_job_host_summary__failed=True which
filters on the now-stale FK. The dashboard count itself was already
fixed to use a subquery annotation. Since DashboardView is deprecated
and has_active_failures is a SerializerMethodField (not filterable),
remove the failures_url entirely rather than creating a custom filter.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Apply black formatting to changed files
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Refactor: replace 10 subquery annotations with bulk prefetch
Instead of annotating every host queryset with 10 correlated subqueries
(summary + job + job_template fields), annotate only _latest_summary_id
and bulk-fetch the full JobHostSummary objects after pagination via
select_related('job', 'job__job_template').
This reduces the SQL from 10 correlated subqueries to 1 subquery + 1 IN
query, addressing review feedback about annotation overhead on host list
views.
- _annotate_host_latest_summary: only annotates _latest_summary_id
- _prefetch_latest_summaries: bulk-fetches and attaches to host objects
- HostSummaryPrefetchMixin: hooks into list() after pagination
- Serializer uses real JobHostSummary objects (no more AnnotatedSummary)
- to_representation always overwrites stale FK values
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Refactor: move latest summary to QuerySet._fetch_all + Host.latest_summary
Per review feedback, replace the view-level HostSummaryPrefetchMixin
with a custom QuerySet that bulk-attaches summaries at evaluation time
(like prefetch_related), and a Host.latest_summary property as the
single access point.
- HostLatestSummaryQuerySet: overrides _fetch_all() to bulk-fetch
JobHostSummary objects with select_related after queryset evaluation
- HostManager now inherits from the custom queryset via from_queryset()
- Host.latest_summary property: uses cache if available, falls back to
individual query
- Remove _annotate_host_latest_summary, _prefetch_latest_summaries,
HostSummaryPrefetchMixin from views — no more list() override needed
- Remove last_job/last_job_host_summary from SUMMARIZABLE_FK_FIELDS
- Serializer uses obj.latest_summary and DEFAULT_SUMMARY_FIELDS loop
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix: scope annotation to views, restore license_error/canceled_on
- Remove with_latest_summary_id() from HostManager.get_queryset() to
avoid applying the correlated subquery to every Host query globally
(count, exists, internal relations)
- Apply with_latest_summary_id() in get_queryset() of the 6
host-serving views only
- Restore license_error and canceled_on to last_job summary fields
to avoid breaking API change
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Guard _fetch_all() to skip bulk-attach on non-annotated querysets
Without this guard, _fetch_all() would set _latest_summary_cache=None
on every host in non-annotated querysets (e.g. Host.objects.filter()),
masking the per-object fallback query in Host.latest_summary.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Remove name from last_job_host_summary and canceled_on from last_job summary
Per reviewer feedback: these fields were not in the original API contract
via SUMMARIZABLE_FK_FIELDS and their addition would be an API change.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add functional tests for HostLatestSummaryQuerySet and Host.latest_summary
Tests cover:
- with_latest_summary_id() annotation and most-recent selection
- _fetch_all() bulk-attach behavior on annotated querysets
- _fetch_all() skips non-annotated querysets (preserves fallback)
- .count() and .exists() do NOT trigger _fetch_all
- Host.latest_summary cache hits (zero queries) and fallback
- Host.latest_job property
- select_related on bulk-attached summaries (no N+1)
- Chaining preserves annotation
- Multiple jobs / partial host coverage
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Apply black formatting to test_host_queryset.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ben Thomasson <bthomass@redhat.com>
* Fix flake8 F841: remove unused job1/job2 variables in tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ben Thomasson <bthomass@redhat.com>
* Add comment explaining why Prefetch was not used for host latest summary
Django Prefetch cannot handle latest per group -- [:1] slicing fetches
1 record globally, not per host (Django ticket #26780). The custom
_fetch_all override uses the same 2-query pattern as prefetch_related
internally, customized for this use case.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix null handling to keep old behavior
---------
Signed-off-by: Ben Thomasson <bthomass@redhat.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: AlanCoding <arominge@redhat.com>
Removes ability to directly create and delete
receptor addresses for a given node.
Instead, receptor addresses are created automatically
if listener_port is set on the Instance.
For example patching "hop" instance
with {"listener_port": 6667}
will create a canonical receptor address with port
6667.
Likewise, peers_from_control_nodes on the instance
sets the peers_from_control_nodes on the canonical
address (if listener port is also set).
protocol is a read-only field that simply reflects
the canonical address protocol.
Other Changes:
- rename k8s_routable to is_internal
- add protocol to ReceptorAddress
- remove peers_from_control_nodes and listener_port
from Instance model
Signed-off-by: Seth Foster <fosterbseth@gmail.com>
Creates a non-deletable address that acts as
the "main" address for this instance.
All other addresses for that instance must
be non-canonical.
When listener_port on an instance is set, automatically
create a canonical receptor address where:
- address is hostname of instance
- port is listener_port
- canonical is True
Additionally, protocol field is added to instance to
denote the receptor listener protocol to use (ws, tcp).
The receptor config listener information is derived from
the listener_port and protocol information. Having a
canonical address that mirrors the listener_port ensures that
an address exists that matches the receptor config information.
Other changes:
- Add managed field to receptor address.
If managed is True, no fields on on this address can be edited
via the API.
If canonical is True, only the address cannot be edited.
- Add managed field to instance. If managed is True, users
cannot set node_state to deprovisioning (i.e. cannot delete node)
This change to our mechanism to prevent users from deleting
the mesh ingress hop node.
- Field is_internal is now renamed to k8s_routable
- Add reverse_peers on instance which is a list of instance IDs
that peer to this instance (via an address)
Signed-off-by: Seth Foster <fosterbseth@gmail.com>
register_peers has inputs:
source: source instance
peers: list of instances the source should peer to
InstanceLink "target" is now expected to be a ReceptorAddress
For each peer, we can just use the first receptor address. If
multiple receptor addresses exist, throw a command error.
Currently this command is only used on VM-deployments, where
there is only a single receptor address per instance, so this
should work fine.
Other changes:
drop listener_port field from Instance. Listener port is now just
"port" on ReceptorAddress
Signed-off-by: Seth Foster <fosterbseth@gmail.com>
group_vars all.yaml changes:
- peer entry has two fields, address and port
- receptor_port is inferred from the first
receptor_address entry that uses protocol tcp
other changes:
ActivityStream now records when receptor_addresses
are peered to
Signed-off-by: Seth Foster <fosterbseth@gmail.com>
- write_receptor_config peers to ReceptorAddress entries
that have peers_from_control_nodes enabled
- peers_from_control_nodes and listener_port removed from Instance model
- peers_from_control_nodes added to ReceptorAddress model
- ReceptorAddress is now unique by address and protocol combination
- Write receptor config task is dispatched upon ReceptorAddress creation
or deletion, and when control node is first created
- InstanceLinkSerializer adds a target_address field and has logic
to grab the instance hostname associated with the peered ReceptorAddress
Signed-off-by: Seth Foster <fosterbseth@gmail.com>
Add post save and post delete hooks to
call write_receptor_config when
a receptor address is added / removed.
Add peers_from_control_nodes to
provision_instance
Signed-off-by: Seth Foster <fosterbseth@gmail.com>
API changes
- cannot change peers or enable
peers_from_control_nodes on VM deployments
- allow setting ip_address
- use ip_address over hostname in the generated
group_vars/all.yml
- Drop api/v2/peers endpoint
DB changes
- add ip_address unique constraint, but ignore "" entries
Other changes
- provision_instance should take listener_port option
Tests
- test that new controls doesn't disturb other peers
relationships
- test ip_address over hostname
* Remove committed_capacity field, delete supporting code
* Track consumed capacity to solve the negatives problem
* Use more verbose name for IG queryset
- Django's PostgreSQL JSONField wraps values in a JsonAdapter, so deal
with that when it happens. This goes away in Django 3.1.
- Setting related *_id fields clears the actual relation field, so
trying to fake objects for tests is a problem
- Instance.objects.me() was inappropriately creating stub objects
every time while running tests, but some of our tests now create
real db objects. Ditch that logic and use a proper fixture where needed.
- awxkit tox.ini was pinned at Python 3.8
* Select control node before start task
Consume capacity on control nodes for controlling tasks and consider
remainging capacity on control nodes before selecting them.
This depends on the requirement that control and hybrid nodes should all
be in the instance group named 'controlplane'. Many tests do not satisfy that
requirement. I'll update the tests in another commit.
* update tests to use controlplane
We don't start any tasks if we don't have a controlplane instance group
Due to updates to fixtures, update tests to set node type and capacity
explicitly so they get expected result.
* Fixes for accounting of control capacity consumed
Update method is used to account for currently consumed capacity for
instance groups in the in-memory capacity tracking data structure we initialize in
after_lock_init and then update via calculate_capacity_consumed (both in
task_manager.py)
Also update fit_task_to_instance to consider control impact on instances
Trust that these functions do the right thing looking for a
node with capacity, and cut out redundant check for the whole group's
capacity per Alan's reccomendation.
* Refactor now redundant code
Deal with control type tasks before we loop over the preferred instance
groups, which cuts out the need for some redundant logic.
Also, fix a bug where I was missing assigning the execution node in one case!
* set job explanation on tasks that need capacity
move the job explanation for jobs that need capacity to a function
so we can re-use it in the three places we need it.
* project updates always run on the controlplane
Instance group ordering makes no sense on project updates because they
always need to run on the control plane.
Also, since hybrid nodes should always run the control processes for the
jobs running on them as execution nodes, account for this when looking for a
execution node.
* fix misleading message
the variables and wording were both misleading, fix to be more accurate
description in the two different cases where this log may be emitted.
* use settings correctly
use settings.DEFAULT_CONTROL_PLANE_QUEUE_NAME instead of a hardcoded
name
cache the controlplane_ig object during the after lock init to avoid
an uneccesary query
eliminate mistakenly duplicated AWX_CONTROL_PLANE_TASK_IMPACT and use
only AWX_CONTROL_NODE_TASK_IMPACT
* add test for control capacity consumption
add test to verify that when there are 2 jobs and only capacity for one
that one will move into waiting and the other stays in pending
* add test for hybrid node capacity consumption
assert that the hybrid node is used for both control and execution and
capacity is deducted correctly
* add test for task.capacity_type = control
Test that control type tasks have the right capacity consumed and
get assigned to the right instance group
Also fix lint in the tests
* jobs_running not accurate for control nodes
We can either NOT use "idle instances" for control nodes, or we need
to update the jobs_running property on the Instance model to count
jobs where the node is the controller_node.
I didn't do that because it may be an expensive query, and it would be
hard to make it match with jobs_running on the InstanceGroup which
filters on tasks assigned to the instance group.
This change chooses to stop considering "idle" control nodes an option,
since we can't acurrately identify them.
The way things are without any change, is we are continuing to over consume capacity on control nodes
because this method sees all control nodes as "idle" at the beginning
of the task manager run, and then only counts jobs started in that run
in the in-memory tracking. So jobs which last over a number of task
manager runs build up consuming capacity, which is accurately reported
via Instance.consumed_capacity
* Reduce default task impact for control nodes
This is something we can experiment with as far as what users
want at install time, but start with just 1 for now.
* update capacity docs
Describe usage of the new setting and the concept of control impact.
Co-authored-by: Alan Rominger <arominge@redhat.com>
Co-authored-by: Rebeccah <rhunter@redhat.com>
- the list, detail, and health check API views should not include them
- the Instance-InstanceGroup association views should not allow them
to be changed
- the ping view excludes them
- list_instances management command excludes them
- Instance.set_capacity_value sets hop nodes to 0 capacity
- TaskManager will exclude them from the nodes available for job execution
- TaskManager.reap_jobs_from_orphaned_instances will consider hop nodes
to be an orphaned instance
- The apply_cluster_membership_policies task will not manipulate hop nodes
- get_broadcast_hosts will ignore hop nodes
- active_count also will ignore hop nodes
This will allow us to control the default container group created via settings, meaning
we could set this in the operator and the default container group would get created with it applied.
We need this for https://github.com/ansible/awx-operator/issues/242
Deepmerge the default podspec and the override
With out this, providing the `spec` for the podspec would override everything
contained, which ends up including the container used, which is not desired
Also, use the same deepmerge function def, as the code seems to be copypasted from
the utils
* Clean up added work_type processing for mesh_code branch
* track both execution and control capacity
* Remove unused execution_capacity property
* Count all forms of capacity to make test pass
* Force jobs to be on execution nodes, updates on control nodes
* Introduce capacity_type property to abstract some details out
* Update test to cover all job types at same time
* Register OpenShift nodes as control types
* Remove unqualified consumed_capacity from task manager and make unit tests work
* Remove unqualified consumed_capacity from task manager and make unit tests work
* Update unit test to execution vs control TM logic changes
* Fix bug, else handling for work_type method
* Model changes for instance last_seen field to replace modified
* Break up refresh_capacity into smaller units
* Rename execution node methods, fix last_seen clustering
* Use update_fields to make it clear save only affects capacity
* Restructing to pass unit tests
* Fix bug where a PATCH did not update capacity value
keep pre-upgrade events in an old table (instead of a partition)
- instead of creating a default partition, keep all events in special
"unpartitioned" tables
- track these tables via distinct proxy=true models
- when generating the queryset for a UnifiedJob's events, look at the
creation date of the job; if it's before the date of the migration,
query on the old unpartitioned table, otherwise use the more modern table
that provides auto-partitioning