Allow manually running a health check, and make other adjustments to the health check trigger (#11002)

* Full finalize the planned work for health checks of execution nodes

* Implementation of instance health_check endpoint

* Also do version conditional to node_type

* Do not use receptor mesh to check main cluster nodes health

* Fix bugs from testing health check of cluster nodes, add doc

* Add a few fields to health check serializer missed before

* Light refactoring of error field processing

* Fix errors clearing error, write more unit tests

* Update health check info in docs

* Bump migration of health check after rebase

* Mark string for translation

* Add related health_check link for system auditors too

* Handle health_check cluster node timeout, add errors for peer judgement
This commit is contained in:
Alan Rominger
2021-09-03 16:37:37 -04:00
committed by GitHub
parent 169c0f6642
commit 6a17e5b65b
15 changed files with 285 additions and 53 deletions

View File

@@ -175,7 +175,7 @@ init:
fi; \ fi; \
$(MANAGEMENT_COMMAND) provision_instance --hostname=$(COMPOSE_HOST) --node_type=$(MAIN_NODE_TYPE); \ $(MANAGEMENT_COMMAND) provision_instance --hostname=$(COMPOSE_HOST) --node_type=$(MAIN_NODE_TYPE); \
$(MANAGEMENT_COMMAND) register_queue --queuename=controlplane --instance_percent=100;\ $(MANAGEMENT_COMMAND) register_queue --queuename=controlplane --instance_percent=100;\
$(MANAGEMENT_COMMAND) register_queue --queuename=default; $(MANAGEMENT_COMMAND) register_queue --queuename=default --instance_percent=100;
if [ ! -f /etc/receptor/certs/awx.key ]; then \ if [ ! -f /etc/receptor/certs/awx.key ]; then \
rm -f /etc/receptor/certs/*; \ rm -f /etc/receptor/certs/*; \
receptor --cert-init commonname="AWX Test CA" bits=2048 outcert=/etc/receptor/certs/ca.crt outkey=/etc/receptor/certs/ca.key; \ receptor --cert-init commonname="AWX Test CA" bits=2048 outcert=/etc/receptor/certs/ca.crt outkey=/etc/receptor/certs/ca.key; \

View File

@@ -25,7 +25,7 @@ __all__ = [
'ProjectUpdatePermission', 'ProjectUpdatePermission',
'InventoryInventorySourcesUpdatePermission', 'InventoryInventorySourcesUpdatePermission',
'UserPermission', 'UserPermission',
'IsSuperUser', 'IsSystemAdminOrAuditor',
'InstanceGroupTowerPermission', 'InstanceGroupTowerPermission',
'WorkflowApprovalPermission', 'WorkflowApprovalPermission',
] ]
@@ -236,13 +236,18 @@ class UserPermission(ModelAccessPermission):
raise PermissionDenied() raise PermissionDenied()
class IsSuperUser(permissions.BasePermission): class IsSystemAdminOrAuditor(permissions.BasePermission):
""" """
Allows access only to admin users. Allows write access only to system admin users.
Allows read access only to system auditor users.
""" """
def has_permission(self, request, view): def has_permission(self, request, view):
return request.user and request.user.is_superuser if not request.user:
return False
if request.method == 'GET':
return request.user.is_superuser or request.user.is_system_auditor
return request.user.is_superuser
class InstanceGroupTowerPermission(ModelAccessPermission): class InstanceGroupTowerPermission(ModelAccessPermission):

View File

@@ -4786,6 +4786,9 @@ class InstanceSerializer(BaseSerializer):
"hostname", "hostname",
"created", "created",
"modified", "modified",
"last_seen",
"last_health_check",
"errors",
'capacity_adjustment', 'capacity_adjustment',
"version", "version",
"capacity", "capacity",
@@ -4806,6 +4809,8 @@ class InstanceSerializer(BaseSerializer):
res = super(InstanceSerializer, self).get_related(obj) res = super(InstanceSerializer, self).get_related(obj)
res['jobs'] = self.reverse('api:instance_unified_jobs_list', kwargs={'pk': obj.pk}) res['jobs'] = self.reverse('api:instance_unified_jobs_list', kwargs={'pk': obj.pk})
res['instance_groups'] = self.reverse('api:instance_instance_groups_list', kwargs={'pk': obj.pk}) res['instance_groups'] = self.reverse('api:instance_instance_groups_list', kwargs={'pk': obj.pk})
if self.context['request'].user.is_superuser or self.context['request'].user.is_system_auditor:
res['health_check'] = self.reverse('api:instance_health_check', kwargs={'pk': obj.pk})
return res return res
def get_consumed_capacity(self, obj): def get_consumed_capacity(self, obj):
@@ -4818,6 +4823,13 @@ class InstanceSerializer(BaseSerializer):
return float("{0:.2f}".format(((float(obj.capacity) - float(obj.consumed_capacity)) / (float(obj.capacity))) * 100)) return float("{0:.2f}".format(((float(obj.capacity) - float(obj.consumed_capacity)) / (float(obj.capacity))) * 100))
class InstanceHealthCheckSerializer(BaseSerializer):
class Meta:
model = Instance
read_only_fields = ('uuid', 'hostname', 'version', 'last_health_check', 'errors', 'cpu', 'memory', 'cpu_capacity', 'mem_capacity', 'capacity')
fields = read_only_fields
class InstanceGroupSerializer(BaseSerializer): class InstanceGroupSerializer(BaseSerializer):
show_capabilities = ['edit', 'delete'] show_capabilities = ['edit', 'delete']

View File

@@ -0,0 +1,33 @@
{% ifmeth GET %}
# Health Check Data
Health checks are used to obtain important data about an instance.
Instance fields affected by the health check are shown in this view.
Fundamentally, health checks require running code on the machine in question.
- For instances with `node_type` of "control" or "hybrid", health checks are
performed as part of a periodic task that runs in the background.
- For instances with `node_type` of "execution", health checks are done by submitting
a work unit through the receptor mesh.
If ran through the receptor mesh, the invoked command is:
```
ansible-runner worker --worker-info
```
For execution nodes, these checks are _not_ performed on a regular basis.
Health checks against functional nodes will be ran when the node is first discovered.
Health checks against nodes with errors will be repeated at a reduced frequency.
{% endifmeth %}
{% ifmeth POST %}
# Manually Initiate a Health Check
For purposes of error remediation or debugging, a health check can be
manually initiated by making a POST request to this endpoint.
This will submit the work unit to the target node through the receptor mesh and wait for it to finish.
The model will be updated with the result.
Up-to-date values of the fields will be returned in the response data.
{% endifmeth %}

View File

@@ -3,7 +3,7 @@
from django.conf.urls import url from django.conf.urls import url
from awx.api.views import InstanceList, InstanceDetail, InstanceUnifiedJobsList, InstanceInstanceGroupsList from awx.api.views import InstanceList, InstanceDetail, InstanceUnifiedJobsList, InstanceInstanceGroupsList, InstanceHealthCheck
urls = [ urls = [
@@ -11,6 +11,7 @@ urls = [
url(r'^(?P<pk>[0-9]+)/$', InstanceDetail.as_view(), name='instance_detail'), url(r'^(?P<pk>[0-9]+)/$', InstanceDetail.as_view(), name='instance_detail'),
url(r'^(?P<pk>[0-9]+)/jobs/$', InstanceUnifiedJobsList.as_view(), name='instance_unified_jobs_list'), url(r'^(?P<pk>[0-9]+)/jobs/$', InstanceUnifiedJobsList.as_view(), name='instance_unified_jobs_list'),
url(r'^(?P<pk>[0-9]+)/instance_groups/$', InstanceInstanceGroupsList.as_view(), name='instance_instance_groups_list'), url(r'^(?P<pk>[0-9]+)/instance_groups/$', InstanceInstanceGroupsList.as_view(), name='instance_instance_groups_list'),
url(r'^(?P<pk>[0-9]+)/health_check/$', InstanceHealthCheck.as_view(), name='instance_health_check'),
] ]
__all__ = ['urls'] __all__ = ['urls']

View File

@@ -108,6 +108,7 @@ from awx.api.permissions import (
InstanceGroupTowerPermission, InstanceGroupTowerPermission,
VariableDataPermission, VariableDataPermission,
WorkflowApprovalPermission, WorkflowApprovalPermission,
IsSystemAdminOrAuditor,
) )
from awx.api import renderers from awx.api import renderers
from awx.api import serializers from awx.api import serializers
@@ -408,6 +409,56 @@ class InstanceInstanceGroupsList(InstanceGroupMembershipMixin, SubListCreateAtta
return None return None
class InstanceHealthCheck(GenericAPIView):
name = _('Instance Health Check')
model = models.Instance
serializer_class = serializers.InstanceHealthCheckSerializer
permission_classes = (IsSystemAdminOrAuditor,)
def get(self, request, *args, **kwargs):
obj = self.get_object()
data = self.get_serializer(data=request.data).to_representation(obj)
return Response(data, status=status.HTTP_200_OK)
def post(self, request, *args, **kwargs):
obj = self.get_object()
if obj.node_type == 'execution':
from awx.main.tasks import execution_node_health_check
runner_data = execution_node_health_check(obj.hostname)
obj.refresh_from_db()
data = self.get_serializer(data=request.data).to_representation(obj)
# Add in some extra unsaved fields
for extra_field in ('transmit_timing', 'run_timing'):
if extra_field in runner_data:
data[extra_field] = runner_data[extra_field]
else:
from awx.main.tasks import cluster_node_health_check
if settings.CLUSTER_HOST_ID == obj.hostname:
cluster_node_health_check(obj.hostname)
else:
cluster_node_health_check.apply_async([obj.hostname], queue=obj.hostname)
start_time = time.time()
prior_check_time = obj.last_health_check
while time.time() - start_time < 50.0:
obj.refresh_from_db(fields=['last_health_check'])
if obj.last_health_check != prior_check_time:
break
if time.time() - start_time < 1.0:
time.sleep(0.1)
else:
time.sleep(1.0)
else:
obj.mark_offline(errors=_('Health check initiated by user determined this instance to be unresponsive'))
obj.refresh_from_db()
data = self.get_serializer(data=request.data).to_representation(obj)
return Response(data, status=status.HTTP_200_OK)
class InstanceGroupList(ListCreateAPIView): class InstanceGroupList(ListCreateAPIView):
name = _("Instance Groups") name = _("Instance Groups")

View File

@@ -23,7 +23,7 @@ from rest_framework import status
# AWX # AWX
from awx.api.generics import APIView, GenericAPIView, ListAPIView, RetrieveUpdateDestroyAPIView from awx.api.generics import APIView, GenericAPIView, ListAPIView, RetrieveUpdateDestroyAPIView
from awx.api.permissions import IsSuperUser from awx.api.permissions import IsSystemAdminOrAuditor
from awx.api.versioning import reverse from awx.api.versioning import reverse
from awx.main.utils import camelcase_to_underscore from awx.main.utils import camelcase_to_underscore
from awx.main.tasks import handle_setting_changes from awx.main.tasks import handle_setting_changes
@@ -150,7 +150,7 @@ class SettingLoggingTest(GenericAPIView):
name = _('Logging Connectivity Test') name = _('Logging Connectivity Test')
model = Setting model = Setting
serializer_class = SettingSingletonSerializer serializer_class = SettingSingletonSerializer
permission_classes = (IsSuperUser,) permission_classes = (IsSystemAdminOrAuditor,)
filter_backends = [] filter_backends = []
def post(self, request, *args, **kwargs): def post(self, request, *args, **kwargs):

View File

@@ -0,0 +1,25 @@
# Generated by Django 2.2.20 on 2021-08-31 17:41
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('main', '0154_set_default_uuid'),
]
operations = [
migrations.AddField(
model_name='instance',
name='errors',
field=models.TextField(blank=True, default='', editable=False, help_text='Any error details from the last health check.'),
),
migrations.AddField(
model_name='instance',
name='last_health_check',
field=models.DateTimeField(
editable=False, help_text='Last time a health check was ran on this instance to refresh cpu, memory, and capacity.', null=True
),
),
]

View File

@@ -82,11 +82,22 @@ class Instance(HasPolicyEditsMixin, BaseModel):
editable=False, editable=False,
help_text=_('Total system memory of this instance in bytes.'), help_text=_('Total system memory of this instance in bytes.'),
) )
errors = models.TextField(
default='',
blank=True,
editable=False,
help_text=_('Any error details from the last health check.'),
)
last_seen = models.DateTimeField( last_seen = models.DateTimeField(
null=True, null=True,
editable=False, editable=False,
help_text=_('Last time instance ran its heartbeat task for main cluster nodes. Last known connection to receptor mesh for execution nodes.'), help_text=_('Last time instance ran its heartbeat task for main cluster nodes. Last known connection to receptor mesh for execution nodes.'),
) )
last_health_check = models.DateTimeField(
null=True,
editable=False,
help_text=_('Last time a health check was ran on this instance to refresh cpu, memory, and capacity.'),
)
# Capacity management # Capacity management
capacity = models.PositiveIntegerField( capacity = models.PositiveIntegerField(
default=100, default=100,
@@ -152,15 +163,16 @@ class Instance(HasPolicyEditsMixin, BaseModel):
grace_period += settings.RECEPTOR_SERVICE_ADVERTISEMENT_PERIOD grace_period += settings.RECEPTOR_SERVICE_ADVERTISEMENT_PERIOD
return self.last_seen < ref_time - timedelta(seconds=grace_period) return self.last_seen < ref_time - timedelta(seconds=grace_period)
def mark_offline(self, update_last_seen=False, perform_save=True): def mark_offline(self, update_last_seen=False, perform_save=True, errors=''):
if self.cpu_capacity == 0 and self.mem_capacity == 0 and self.capacity == 0 and (not update_last_seen): if self.cpu_capacity == 0 and self.mem_capacity == 0 and self.capacity == 0 and self.errors == errors and (not update_last_seen):
return return
self.cpu_capacity = self.mem_capacity = self.capacity = 0 self.cpu_capacity = self.mem_capacity = self.capacity = 0
self.errors = errors
if update_last_seen: if update_last_seen:
self.last_seen = now() self.last_seen = now()
if perform_save: if perform_save:
update_fields = ['capacity', 'cpu_capacity', 'mem_capacity'] update_fields = ['capacity', 'cpu_capacity', 'mem_capacity', 'errors']
if update_last_seen: if update_last_seen:
update_fields += ['last_seen'] update_fields += ['last_seen']
self.save(update_fields=update_fields) self.save(update_fields=update_fields)
@@ -180,11 +192,12 @@ class Instance(HasPolicyEditsMixin, BaseModel):
self.mem_capacity = get_mem_effective_capacity(self.memory) self.mem_capacity = get_mem_effective_capacity(self.memory)
self.set_capacity_value() self.set_capacity_value()
def save_health_data(self, version, cpu, memory, uuid=None, last_seen=None, has_error=False): def save_health_data(self, version, cpu, memory, uuid=None, update_last_seen=False, errors=''):
update_fields = [] self.last_health_check = now()
update_fields = ['last_health_check']
if last_seen is not None and self.last_seen != last_seen: if update_last_seen:
self.last_seen = last_seen self.last_seen = self.last_health_check
update_fields.append('last_seen') update_fields.append('last_seen')
if uuid is not None and self.uuid != uuid: if uuid is not None and self.uuid != uuid:
@@ -207,25 +220,26 @@ class Instance(HasPolicyEditsMixin, BaseModel):
self.memory = new_memory self.memory = new_memory
update_fields.append('memory') update_fields.append('memory')
if not has_error: if not errors:
self.refresh_capacity_fields() self.refresh_capacity_fields()
self.errors = ''
else: else:
self.mark_offline(perform_save=False) self.mark_offline(perform_save=False, errors=errors)
update_fields.extend(['cpu_capacity', 'mem_capacity', 'capacity']) update_fields.extend(['cpu_capacity', 'mem_capacity', 'capacity', 'errors'])
self.save(update_fields=update_fields) self.save(update_fields=update_fields)
def local_health_check(self): def local_health_check(self):
"""Only call this method on the instance that this record represents""" """Only call this method on the instance that this record represents"""
has_error = False errors = None
try: try:
# if redis is down for some reason, that means we can't persist # if redis is down for some reason, that means we can't persist
# playbook event data; we should consider this a zero capacity event # playbook event data; we should consider this a zero capacity event
redis.Redis.from_url(settings.BROKER_URL).ping() redis.Redis.from_url(settings.BROKER_URL).ping()
except redis.ConnectionError: except redis.ConnectionError:
has_error = True errors = _('Failed to connect ot Redis')
self.save_health_data(awx_application_version, get_cpu_count(), get_mem_in_bytes(), last_seen=now(), has_error=has_error) self.save_health_data(awx_application_version, get_cpu_count(), get_mem_in_bytes(), update_last_seen=True, errors=errors)
class InstanceGroup(HasPolicyEditsMixin, BaseModel, RelatedJobsMixin): class InstanceGroup(HasPolicyEditsMixin, BaseModel, RelatedJobsMixin):

View File

@@ -177,7 +177,7 @@ def dispatch_startup():
def inform_cluster_of_shutdown(): def inform_cluster_of_shutdown():
try: try:
this_inst = Instance.objects.get(hostname=settings.CLUSTER_HOST_ID) this_inst = Instance.objects.get(hostname=settings.CLUSTER_HOST_ID)
this_inst.mark_offline(update_last_seen=True) # No thank you to new jobs while shut down this_inst.mark_offline(update_last_seen=True, errors=_('Instance received normal shutdown signal'))
try: try:
reaper.reap(this_inst) reaper.reap(this_inst)
except Exception: except Exception:
@@ -408,14 +408,37 @@ def cleanup_execution_environment_images():
logger.debug(f"Failed to delete image {image_name}") logger.debug(f"Failed to delete image {image_name}")
@task(queue=get_local_queuename)
def cluster_node_health_check(node):
'''
Used for the health check endpoint, refreshes the status of the instance, but must be ran on target node
'''
if node == '':
logger.warn('Local health check incorrectly called with blank string')
return
elif node != settings.CLUSTER_HOST_ID:
logger.warn(f'Local health check for {node} incorrectly sent to {settings.CLUSTER_HOST_ID}')
return
try:
this_inst = Instance.objects.me()
except Instance.DoesNotExist:
logger.warn(f'Instance record for {node} missing, could not check capacity.')
return
this_inst.local_health_check()
@task(queue=get_local_queuename) @task(queue=get_local_queuename)
def execution_node_health_check(node): def execution_node_health_check(node):
if node == '':
logger.warn('Remote health check incorrectly called with blank string')
return
try: try:
instance = Instance.objects.get(hostname=node) instance = Instance.objects.get(hostname=node)
except Instance.DoesNotExist: except Instance.DoesNotExist:
logger.warn(f'Instance record for {node} missing, could not check capacity.') logger.warn(f'Instance record for {node} missing, could not check capacity.')
return return
data = worker_info(node)
data = worker_info(node, work_type='ansible-runner' if instance.node_type == 'execution' else 'local')
prior_capacity = instance.capacity prior_capacity = instance.capacity
@@ -424,7 +447,7 @@ def execution_node_health_check(node):
cpu=data.get('cpu_count', 0), cpu=data.get('cpu_count', 0),
memory=data.get('mem_in_bytes', 0), memory=data.get('mem_in_bytes', 0),
uuid=data.get('uuid'), uuid=data.get('uuid'),
has_error=bool(data.get('errors')), errors='\n'.join(data.get('errors', [])),
) )
if data['errors']: if data['errors']:
@@ -436,6 +459,8 @@ def execution_node_health_check(node):
else: else:
logger.info('Set capacity of execution node {} to {}, worker info data:\n{}'.format(node, instance.capacity, json.dumps(data, indent=2))) logger.info('Set capacity of execution node {} to {}, worker info data:\n{}'.format(node, instance.capacity, json.dumps(data, indent=2)))
return data
def inspect_execution_nodes(instance_list): def inspect_execution_nodes(instance_list):
with advisory_lock('inspect_execution_nodes_lock', wait=False): with advisory_lock('inspect_execution_nodes_lock', wait=False):
@@ -488,10 +513,12 @@ def inspect_execution_nodes(instance_list):
logger.warn(f'Execution node attempting to rejoin as instance {hostname}.') logger.warn(f'Execution node attempting to rejoin as instance {hostname}.')
execution_node_health_check.apply_async([hostname]) execution_node_health_check.apply_async([hostname])
elif instance.capacity == 0: elif instance.capacity == 0:
# Periodically re-run the health check of errored nodes, in case someone fixed it # nodes with proven connection but need remediation run health checks are reduced frequency
# TODO: perhaps decrease the frequency of these checks if not instance.last_health_check or (nowtime - instance.last_health_check).total_seconds() >= settings.EXECUTION_NODE_REMEDIATION_CHECKS:
logger.debug(f'Restarting health check for execution node {hostname} with known errors.') # Periodically re-run the health check of errored nodes, in case someone fixed it
execution_node_health_check.apply_async([hostname]) # TODO: perhaps decrease the frequency of these checks
logger.debug(f'Restarting health check for execution node {hostname} with known errors.')
execution_node_health_check.apply_async([hostname])
@task(queue=get_local_queuename) @task(queue=get_local_queuename)
@@ -556,7 +583,7 @@ def cluster_node_heartbeat():
# If auto deprovisining is on, don't bother setting the capacity to 0 # If auto deprovisining is on, don't bother setting the capacity to 0
# since we will delete the node anyway. # since we will delete the node anyway.
if other_inst.capacity != 0 and not settings.AWX_AUTO_DEPROVISION_INSTANCES: if other_inst.capacity != 0 and not settings.AWX_AUTO_DEPROVISION_INSTANCES:
other_inst.mark_offline() other_inst.mark_offline(errors=_('Another cluster node has determined this instance to be unresponsive'))
logger.error("Host {} last checked in at {}, marked as lost.".format(other_inst.hostname, other_inst.last_seen)) logger.error("Host {} last checked in at {}, marked as lost.".format(other_inst.hostname, other_inst.last_seen))
elif settings.AWX_AUTO_DEPROVISION_INSTANCES: elif settings.AWX_AUTO_DEPROVISION_INSTANCES:
deprovision_hostname = other_inst.hostname deprovision_hostname = other_inst.hostname
@@ -3028,12 +3055,17 @@ class AWXReceptorJob:
# We establish a connection to the Receptor socket # We establish a connection to the Receptor socket
receptor_ctl = get_receptor_ctl() receptor_ctl = get_receptor_ctl()
res = None
try: try:
return self._run_internal(receptor_ctl) res = self._run_internal(receptor_ctl)
return res
finally: finally:
# Make sure to always release the work unit if we established it # Make sure to always release the work unit if we established it
if self.unit_id is not None and settings.RECEPTOR_RELEASE_WORK: if self.unit_id is not None and settings.RECEPTOR_RELEASE_WORK:
receptor_ctl.simple_command(f"work release {self.unit_id}") receptor_ctl.simple_command(f"work release {self.unit_id}")
# If an error occured without the job itself failing, it could be a broken instance
if self.work_type == 'ansible-runner' and res is None or getattr(res, 'rc', None) is None:
execution_node_health_check(self.task.instance.execution_node)
def _run_internal(self, receptor_ctl): def _run_internal(self, receptor_ctl):
# Create a socketpair. Where the left side will be used for writing our payload # Create a socketpair. Where the left side will be used for writing our payload

View File

@@ -1,13 +1,22 @@
import pytest import pytest
from awx.api.versioning import reverse from unittest import mock
from awx.api.versioning import reverse
from awx.main.models.ha import Instance from awx.main.models.ha import Instance
import redis
# Django
from django.test.utils import override_settings
INSTANCE_KWARGS = dict(hostname='example-host', cpu=6, memory=36000000000, cpu_capacity=6, mem_capacity=42)
@pytest.mark.django_db @pytest.mark.django_db
def test_disabled_zeros_capacity(patch, admin_user): def test_disabled_zeros_capacity(patch, admin_user):
instance = Instance.objects.create(hostname='example-host', cpu=6, memory=36000000000, cpu_capacity=6, mem_capacity=42) instance = Instance.objects.create(**INSTANCE_KWARGS)
url = reverse('api:instance_detail', kwargs={'pk': instance.pk}) url = reverse('api:instance_detail', kwargs={'pk': instance.pk})
@@ -20,7 +29,7 @@ def test_disabled_zeros_capacity(patch, admin_user):
@pytest.mark.django_db @pytest.mark.django_db
def test_enabled_sets_capacity(patch, admin_user): def test_enabled_sets_capacity(patch, admin_user):
instance = Instance.objects.create(hostname='example-host', enabled=False, cpu=6, memory=36000000000, cpu_capacity=6, mem_capacity=42, capacity=0) instance = Instance.objects.create(enabled=False, capacity=0, **INSTANCE_KWARGS)
assert instance.capacity == 0 assert instance.capacity == 0
url = reverse('api:instance_detail', kwargs={'pk': instance.pk}) url = reverse('api:instance_detail', kwargs={'pk': instance.pk})
@@ -30,3 +39,25 @@ def test_enabled_sets_capacity(patch, admin_user):
instance.refresh_from_db() instance.refresh_from_db()
assert instance.capacity > 0 assert instance.capacity > 0
@pytest.mark.django_db
def test_auditor_user_health_check(get, post, system_auditor):
instance = Instance.objects.create(**INSTANCE_KWARGS)
url = reverse('api:instance_health_check', kwargs={'pk': instance.pk})
r = get(url=url, user=system_auditor, expect=200)
assert r.data['cpu_capacity'] == instance.cpu_capacity
post(url=url, user=system_auditor, expect=403)
@pytest.mark.django_db
@mock.patch.object(redis.client.Redis, 'ping', lambda self: True)
def test_health_check_usage(get, post, admin_user):
instance = Instance.objects.create(**INSTANCE_KWARGS)
url = reverse('api:instance_health_check', kwargs={'pk': instance.pk})
r = get(url=url, user=admin_user, expect=200)
assert r.data['cpu_capacity'] == instance.cpu_capacity
assert r.data['last_health_check'] is None
with override_settings(CLUSTER_HOST_ID=instance.hostname): # force direct call of cluster_node_health_check
r = post(url=url, user=admin_user, expect=200)
assert r.data['last_health_check'] is not None

View File

@@ -324,6 +324,23 @@ def test_instance_group_capacity(instance_factory, instance_group_factory):
assert ig_single.capacity == 100 assert ig_single.capacity == 100
@pytest.mark.django_db
def test_health_check_clears_errors():
instance = Instance.objects.create(hostname='foo-1', enabled=True, capacity=0, errors='something went wrong')
data = dict(version='ansible-runner-4.2', cpu=782, memory=int(39e9), uuid='asdfasdfasdfasdfasdf', errors='')
instance.save_health_data(**data)
for k, v in data.items():
assert getattr(instance, k) == v
@pytest.mark.django_db
def test_health_check_oh_no():
instance = Instance.objects.create(hostname='foo-2', enabled=True, capacity=52, cpu=8, memory=int(40e9))
instance.save_health_data('', 0, 0, errors='This it not a real instance!')
assert instance.capacity == instance.cpu_capacity == 0
assert instance.errors == 'This it not a real instance!'
@pytest.mark.django_db @pytest.mark.django_db
class TestInstanceGroupOrdering: class TestInstanceGroupOrdering:
def test_ad_hoc_instance_groups(self, instance_group_factory, inventory, default_instance_group): def test_ad_hoc_instance_groups(self, instance_group_factory, inventory, default_instance_group):

View File

@@ -28,13 +28,16 @@ def get_receptor_ctl():
return ReceptorControl(receptor_sockfile) return ReceptorControl(receptor_sockfile)
def worker_info(node_name): def worker_info(node_name, work_type='ansible-runner'):
receptor_ctl = get_receptor_ctl() receptor_ctl = get_receptor_ctl()
transmit_start = time.time() transmit_start = time.time()
error_list = [] error_list = []
data = {'errors': error_list, 'transmit_timing': 0.0} data = {'errors': error_list, 'transmit_timing': 0.0}
result = receptor_ctl.submit_work(worktype='ansible-runner', payload='', params={"params": f"--worker-info"}, ttl='20s', node=node_name) kwargs = {}
if work_type != 'local':
kwargs['ttl'] = '20s'
result = receptor_ctl.submit_work(worktype=work_type, payload='', params={"params": f"--worker-info"}, node=node_name, **kwargs)
unit_id = result['unitid'] unit_id = result['unitid']
run_start = time.time() run_start = time.time()
@@ -90,9 +93,11 @@ def worker_info(node_name):
error_list.extend(remote_data.pop('errors', [])) # merge both error lists error_list.extend(remote_data.pop('errors', [])) # merge both error lists
data.update(remote_data) data.update(remote_data)
# see tasks.py usage of keys # If we have a connection error, missing keys would be trivial consequence of that
missing_keys = set(('runner_version', 'mem_in_bytes', 'cpu_count')) - set(data.keys()) if not data['errors']:
if missing_keys: # see tasks.py usage of keys
data['errors'].append('Worker failed to return keys {}'.format(' '.join(missing_keys))) missing_keys = set(('runner_version', 'mem_in_bytes', 'cpu_count')) - set(data.keys())
if missing_keys:
data['errors'].append('Worker failed to return keys {}'.format(' '.join(missing_keys)))
return data return data

View File

@@ -422,6 +422,7 @@ os.environ.setdefault('DJANGO_LIVE_TEST_SERVER_ADDRESS', 'localhost:9013-9199')
# heartbeat period can factor into some forms of logic, so it is maintained as a setting here # heartbeat period can factor into some forms of logic, so it is maintained as a setting here
CLUSTER_NODE_HEARTBEAT_PERIOD = 60 CLUSTER_NODE_HEARTBEAT_PERIOD = 60
RECEPTOR_SERVICE_ADVERTISEMENT_PERIOD = 60 # https://github.com/ansible/receptor/blob/aa1d589e154d8a0cb99a220aff8f98faf2273be6/pkg/netceptor/netceptor.go#L34 RECEPTOR_SERVICE_ADVERTISEMENT_PERIOD = 60 # https://github.com/ansible/receptor/blob/aa1d589e154d8a0cb99a220aff8f98faf2273be6/pkg/netceptor/netceptor.go#L34
EXECUTION_NODE_REMEDIATION_CHECKS = 60 * 10 # once every 10 minutes check if an execution node errors have been resolved
BROKER_URL = 'unix:///var/run/redis/redis.sock' BROKER_URL = 'unix:///var/run/redis/redis.sock'
CELERYBEAT_SCHEDULE = { CELERYBEAT_SCHEDULE = {

View File

@@ -61,10 +61,9 @@ Here is a listing of work types that you may encounter:
- `kubernetes-runtime-auth` - user-space jobs ran in a container group - `kubernetes-runtime-auth` - user-space jobs ran in a container group
- `kubernetes-incluster-auth` - project updates and management jobs on OpenShift Container Platform - `kubernetes-incluster-auth` - project updates and management jobs on OpenShift Container Platform
### Auto-discovery of execution nodes ### Auto-discovery of Execution Nodes
Instances in control plane must be registered by the installer via `awx-manage` Instances in control plane must be registered by the installer via `awx-manage register_queue` or `awx-manage register_instance`.
commands like `awx-manage register_queue` or `awx-manage register_instance`.
Execution-only nodes are automatically discovered after they have been configured and join the receptor mesh. Execution-only nodes are automatically discovered after they have been configured and join the receptor mesh.
Control nodes should see them as a "Known Node". Control nodes should see them as a "Known Node".
@@ -72,32 +71,38 @@ Control nodes should see them as a "Known Node".
Control nodes check the receptor network (reported via `receptorctl status`) when their heartbeat task runs. Control nodes check the receptor network (reported via `receptorctl status`) when their heartbeat task runs.
Nodes on the receptor network are compared against the `Instance` model in the database. Nodes on the receptor network are compared against the `Instance` model in the database.
If a node appears in the mesh network which is not in the database, then a "health check" is started. If a node appears in the receptor mesh which is not in the database,
Fields like `cpu`, `memory`, and `version` will obtain a non-default value through this process. then a database entry is created and added to the "default" instance group.
In order to run jobs on execution nodes, either the installer needs to pre-register the node, In order to run jobs on execution nodes, either the installer needs to pre-register the node,
or user needs to make a PATCH request to `/api/v2/instances/N/` to change the `enabled` field to true. or user needs to make a PATCH request to `/api/v2/instances/N/` to change the `enabled` field to true.
Execution nodes should automatically be placed in the default instance group.
#### Health Check Mechanics #### Health Check Mechanics
All relevant data for health checks is reported from the ansible-runner command: Fields like `cpu`, `memory`, and `version` will obtain a non-default value from the health check.
If the instance has problems that would prevent jobs from running, `capacity` will be set to zero,
and details will be shown in the instance's `errors` field.
For execution nodes, relevant data for health checks is reported from the ansible-runner command:
``` ```
ansible-runner worker --worker-info ansible-runner worker --worker-info
``` ```
This will output YAML data to standard out containing CPU, memory, and other metrics used to compute `capacity`. This will output YAML data to standard out containing CPU, memory, and other metrics used to compute `capacity`.
AWX invokes this command by submitting a receptor work unit (of type `ansible-runner`) to the target execution node. AWX invokes this command by submitting a receptor work unit (of type `ansible-runner`) to the target execution node.
If you have the development environment running, you can run a one-off health check of a node with this command:
``` ##### Health Check Triggers
echo "from awx.main.utils.receptor import worker_info; worker_info('receptor-1')" | awx-manage shell_plus --quiet
```
This must be ran as the awx user inside one of the hybrid or control nodes. Health checks for execution nodes have several triggers that can cause it to run.
This will not affect actual `Instance` record, but will just run the command and report the data. - When an execution node is auto-discovered, a health check is started
- For execution nodes with errors, health checks are re-ran once about every 10 minutes for auto-remediation
- If a job had an error _not from the Ansible subprocess_ then a health check is started to check for instance errors
- System administrators can manually trigger a health check by making a POST request to `/api/v2/instances/N/health_check/`.
Healthy execution nodes will _not_ have health checks ran on a regular basis.
Control and hybrid nodes run health checks via a periodic task (bypassing ansible-runner).
### Development Environment ### Development Environment