Merge pull request #11070 from beeankha/receptor_related_docs_changes

Update Tasks and Clustering Doc Files
2026-02-28 16:28:43 -03:30 · 2021-09-29 11:17:47 -04:00
parent b9131b9e8b 4c7c89b410
commit 7eefa897b3
4 changed files with 68 additions and 168 deletions
--- a/docs/clustering.md
+++ b/docs/clustering.md
@@ -72,13 +72,13 @@ Recommendations and constraints:

 ### Provisioning and Deprovisioning Instances and Groups

-* **Provisioning** - Provisioning Instances after installation is supported by updating the `inventory` file and re-running the setup playbook. It's important that this file contain all passwords and information used when installing the cluster, or other instances may be reconfigured (this can be done intentionally).
+* **Provisioning** - Provisioning instances after installation is supported by updating the `inventory` file and re-running the setup playbook. It's important that this file contains all passwords and related information used when installing the cluster; if this is not the case, other instances may be reconfigured (this can be done intentionally).

-* **Deprovisioning** - AWX does not automatically de-provision instances since it cannot distinguish between an instance that was taken offline intentionally or due to failure. Instead, the procedure for de-provisioning an instance is to shut it down (or stop the `automation-controller-service`) and run the AWX de-provision command:
+* **Deprovisioning** - AWX does not automatically deprovision instances since it cannot distinguish between an instance that was taken offline intentionally or due to failure.

-```
-$ awx-manage deprovision_instance --hostname=<hostname>
-```
+  Starting with AWX version 19.3.0, deprovisioning an instance results in one or more Receptor configurations needing to be updated across one or more nodes, which therefore cannot be done via a manual process; the Automation Mesh Installer needs to deprovision the nodes.
+
+  Adding to and removing from the mesh does not require that every node is listed in the inventory file; in other words, the absence of a node from the inventory file _does not_ indicate that a node should be removed. Instead, a `hostvar` of `node_state: deprovision` conveys to the mesh installer that the node should be deprovisioned.

 * **Removing/Deprovisioning Instance Groups** - AWX does not automatically de-provision or remove instance groups, even though re-provisioning will often cause these to be unused. They may still show up in API endpoints and stats monitoring. These groups can be removed with the following command:

--- a/docs/receptor_mesh.md
+++ b/docs/receptor_mesh.md
@@ -1,154 +0,0 @@
-## Receptor Mesh
-
-AWX uses a [Receptor](https://github.com/ansible/receptor) mesh to transmit "user-space" unified jobs:
- - jobs
- - ad hoc commands
- - inventory updates
-
-to the node where they run.
-
-> NOTE: user-space jobs are what carry out the user's Ansible automation. These job types run inside of the designated execution environment so that the needed content is available.
-
-> NOTE: The word "node" corresponds to entries in the `Instance` database model, or the `/api/v2/instances/` endpoint, and is a machine participating in the cluster / mesh.
-
-The unified jobs API reports `controller_node` and `execution_node` fields.
-The execution node is where the job runs, and the controller node interfaces between the job and server functions.
-
-Before a job can start, the controller node prepares the `private_data_dir` needed for the job to run.
-Next, the controller node sends the data via `ansible-runner`'s `transmit`, and connects to the output stream with `process`.
-For details on these commands, see the [ansible-runner docs on remote execution](https://ansible-runner.readthedocs.io/en/latest/remote_jobs.html).
-
-On the other side, the execution node runs the job under `ansible-runner worker`.
-
-### Split of Control Plane versus Execution Plane
-
-Instances in the **control plane** run persistent AWX services (like the web server, task dispatcher, etc.), project updates, and management jobs.
-
-The task manager logic will not send user-space jobs to **control-only** nodes.
-In the inventory definition, the user can set a flag to designate this node type.
-
-**Execution-only** nodes have a minimal set of software requirements needed to participate in the receptor mesh and run jobs under ansible-runner with podman isolation.
-These _only_ run user-space jobs, and may be geographically separated (with high latency) from the control plane.
-They may not even have a direct connection to the cluster, and use other receptor **hop** nodes to communicate.
-The hop and execution-only nodes may be referred to collectively as the **execution plane**.
-
-**Hybrid** (control & execution nodes) are instances in the control plane that are allowed to run user-space jobs.
-
-#### Receptor Configuration Work Type
-
-Execution-only nodes need to advertise the "ansible-runner" work type.
-User-space jobs are submitted as a receptor work unit with this work type.
-
-An entry like this should appear in its `receptor.conf` (receptor configuration file):
-
-```
- work-command:
-    worktype: ansible-runner
-    command: ansible-runner
-    params: worker
-    allowruntimeparams: true
-```
-
-Control (and hybrid) nodes advertise the "local" work type instead.
-So the entry is the same as above, except that it has `worktype: local`.
-Project updates are submitted as this work type.
-If user-space jobs run on a hybrid node, they will also run as the "local" work type.
-
-Here is a listing of work types that you may encounter:
-
- - `local` - any ansible-runner job ran in a traditional install
- - `ansible-runner` - remote execution of user-space jobs
- - `kubernetes-runtime-auth` - user-space jobs ran in a container group
- - `kubernetes-incluster-auth` - project updates and management jobs on OpenShift Container Platform
-
-### Auto-discovery of Execution Nodes
-
-Instances in control plane must be registered by the installer via `awx-manage register_queue` or `awx-manage register_instance`.
-
-Execution-only nodes are automatically discovered after they have been configured and join the receptor mesh.
-Control nodes should see them as a "Known Node".
-
-Control nodes check the receptor network (reported via `receptorctl status`) when their heartbeat task runs.
-Nodes on the receptor network are compared against the `Instance` model in the database.
-
-If a node appears in the receptor mesh which is not in the database,
-then a database entry is created and added to the "default" instance group.
-
-In order to run jobs on execution nodes, either the installer needs to pre-register the node,
-or user needs to make a PATCH request to `/api/v2/instances/N/` to change the `enabled` field to true.
-
-#### Health Check Mechanics
-
-Fields like `cpu`, `memory`, and `version` will obtain a non-default value from the health check.
-If the instance has problems that would prevent jobs from running, `capacity` will be set to zero,
-and details will be shown in the instance's `errors` field.
-
-For execution nodes, relevant data for health checks is reported from the ansible-runner command:
-
-```
-ansible-runner worker --worker-info
-```
-
-This will output YAML data to standard out containing CPU, memory, and other metrics used to compute `capacity`.
-AWX invokes this command by submitting a receptor work unit (of type `ansible-runner`) to the target execution node.
-
-##### Health Check Triggers
-
-Health checks for execution nodes have several triggers that can cause it to run.
- - When an execution node is auto-discovered, a health check is started
- - For execution nodes with errors, health checks are re-ran once about every 10 minutes for auto-remediation
- - If a job had an error _not from the Ansible subprocess_ then a health check is started to check for instance errors
- - System administrators can manually trigger a health check by making a POST request to `/api/v2/instances/N/health_check/`.
-
-Healthy execution nodes will _not_ have health checks ran on a regular basis.
-
-Control and hybrid nodes run health checks via a periodic task (bypassing ansible-runner).
-
-### Development Environment
-
-A cluster (of containers) with execution nodes and a hop node is created by the docker-compose Makefile target.
-By default, it will create 1 hybrid node, 1 hop node, and 2 execution nodes.
-You can switch the type of AWX nodes between hybrid and control with this syntax.
-
-```
-MAIN_NODE_TYPE=control COMPOSE_TAG=devel make docker-compose
-```
-
-Running the above command will create a cluster of 1 control node, 1 hop node, and 2 execution nodes.
-
-
-The number of nodes can be changed:
-
-```
-CONTROL_PLANE_NODE_COUNT=2 EXECUTION_NODE_COUNT=3 COMPOSE_TAG=devel make docker-compose
-```
-
-This will spin up a topology represented below.
-(names are the receptor node names, which differ from the AWX Instance names and network address in some cases)
-
-```
-                                            ┌──────────────┐
-                                            │              │
-┌──────────────┐                 ┌──────────┤  receptor-1  │
-│              │                 │          │              │
-│    awx_1     │◄──────────┐     │          └──────────────┘
-│              │           │     ▼
-└──────┬───────┘    ┌──────┴───────┐        ┌──────────────┐
-       │            │              │        │              │
-       │            │ receptor-hop │◄───────┤  receptor-2  │
-       ▼            │              │        │              │
-┌──────────────┐    └──────────────┘        └──────────────┘
-│              │                 ▲
-│    awx_2     │                 │          ┌──────────────┐
-│              │                 │          │              │
-└──────────────┘                 └──────────┤  receptor-3  │
-                                            │              │
-                                            └──────────────┘
-```
-
-All execution (`receptor-*`) nodes connect to the hop node.
-Only the `awx_1` node connects to the hop node out of the AWX cluster.
-`awx_1` connects to `awx_2`, fulfilling the requirement that the AWX cluster is fully connected.
-
-For an example, if a job is launched with `awx_2` as the `controller_node` and `receptor-3` as the `execution_node`,
-then `awx_2` communicates to `receptor-3` via `awx_1` and then `receptor-hop`.
--- a/docs/tasks.md
+++ b/docs/tasks.md
@@ -157,10 +157,14 @@ One of the most important tasks in a clustered AWX installation is the periodic
 If a node in an AWX cluster discovers that one of its peers has not updated its heartbeat within a certain grace period, it is assumed to be offline, and its capacity is set to zero to avoid scheduling new tasks on that node. Additionally, jobs allegedly running or scheduled to run on that node are assumed to be lost, and "reaped", or marked as failed.

 ## Reaping Receptor Work Units
-When an AWX job is launched via receptor, files such as status, stdin, and stdout are created in a specific receptor directory. This directory on disk is a random 8 character string, e.g. qLL2JFNT
-This is also called the work Unit ID in receptor, and is used in various receptor commands, e.g. "work results qLL2JFNT"
-After an AWX job executes, the receptor work unit directory is cleaned up by issuing the work release command. In some cases the release process might fail, or if AWX crashes during a job's execution, the work release command is never issued to begin with.
-As such, there is a periodic task that will obtain a list of all receptor work units, and find which ones belong to AWX jobs that are in a completed state (status is canceled, error, or succeeded). This task will call "work release" on each of these work units to clean up the files on disk.
+
+Each AWX job launch will start a "Receptor work unit". This work unit handles all of the `stdin`, `stdout`, and `status` of the job running on the mesh and will also write data to the disk.
+
+Files such as `status`, `stdin`, and `stdout` are created in a specific Receptor directory which is named via a randomly-generated 8-character string (_e.g._ `qLL2JFNT`). This string is also the work unit ID in Receptor, and is utilized in various Receptor commands (_e.g._ `work results qLL2JFNT`).
+
+The files that get written to disk via the work unit will get cleaned up after the AWX job finishes; the way that this is done is by issuing the `work release` command. In some cases, the release process might fail, or if AWX crashes during a job's execution, the `work release` command is never issued to begin with.
+
+Because of this, there is a periodic task that will obtain a list of all Receptor work units and find which ones belong to AWX jobs that are in a completed state (where the status is either `canceled`, `error`, or `succeeded`). This task will call `work release` on each of these work units and clean up the files on disk.

 ## AWX Jobs

--- a/tools/docker-compose/README.md
+++ b/tools/docker-compose/README.md
@@ -20,7 +20,7 @@ Once you have a local copy, run the commands in the following sections from the

 ## Overview

-Here are the main make targets:
+Here are the main `make` targets:

 - `docker-compose-build` - used for building the development image, which is used by the `docker-compose` target
 - `docker-compose` - make target for development, passes awx_devel image and tag
@@ -59,7 +59,7 @@ AWX requires access to a PostgreSQL database, and by default, one will be create

 ## Starting the Development Environment

-### Build the image
+### Build the Image

 The AWX base container image (defined in the Dockerfile templated from [Dockerfile.j2](./../ansible/roles/dockerfile/templates/Dockerfile.j2)) contains basic OS dependencies and symbolic links into the development environment that make running the services easy.

@@ -96,6 +96,56 @@ $ make docker-compose

 > For running docker-compose detached mode, start the containers using the following command: `$ make docker-compose COMPOSE_UP_OPTS=-d`

+
+##### _(alternative method)_ Spin up a development environment with customized mesh node cluster
+
+With the introduction of Receptor, a cluster (of containers) with execution nodes and a hop node can be created by the docker-compose Makefile target.
+By default, it will create 1 hybrid node, 1 hop node, and 2 execution nodes.
+You can switch the type of AWX nodes between hybrid and control with this syntax.
+
+```
+MAIN_NODE_TYPE=control COMPOSE_TAG=devel make docker-compose
+```
+
+Running the above command will create a cluster of 1 control node, 1 hop node, and 2 execution nodes.
+
+The number of nodes can be changed:
+
+```
+CONTROL_PLANE_NODE_COUNT=2 EXECUTION_NODE_COUNT=3 COMPOSE_TAG=devel make docker-compose
+```
+
+This will spin up a topology represented below.
+(names are the receptor node names, which differ from the AWX Instance names and network address in some cases)
+
+```
+                                            ┌──────────────┐
+                                            │              │
+┌──────────────┐                 ┌──────────┤  receptor-1  │
+│              │                 │          │              │
+│    awx_1     │◄──────────┐     │          └──────────────┘
+│              │           │     ▼
+└──────┬───────┘    ┌──────┴───────┐        ┌──────────────┐
+       │            │              │        │              │
+       │            │ receptor-hop │◄───────┤  receptor-2  │
+       ▼            │              │        │              │
+┌──────────────┐    └──────────────┘        └──────────────┘
+│              │                 ▲
+│    awx_2     │                 │          ┌──────────────┐
+│              │                 │          │              │
+└──────────────┘                 └──────────┤  receptor-3  │
+                                            │              │
+                                            └──────────────┘
+```
+
+All execution (`receptor-*`) nodes connect to the hop node.
+Only the `awx_1` node connects to the hop node out of the AWX cluster.
+`awx_1` connects to `awx_2`, fulfilling the requirement that the AWX cluster is fully connected.
+
+For example, if a job is launched with `awx_2` as the `controller_node` and `receptor-3` as the `execution_node`,
+then `awx_2` communicates to `receptor-3` via `awx_1` and then `receptor-hop`.
+
+
 ##### Wait for migrations to complete

 The first time you start the environment, database migrations need to run in order to build the PostgreSQL database. It will take few moments, but eventually you will see output in your terminal session that looks like the following:
@@ -116,7 +166,7 @@ awx_1        |   Applying auth.0001_initial... OK
 ...
 ```

-##### Clean and Build UI
+##### Clean and build the UI

 ```bash
 $ docker exec tools_awx_1 make clean-ui ui-devel
@@ -136,7 +186,7 @@ $ docker exec -ti tools_awx_1 awx-manage createsuperuser

 > Remember the username and password, as you will use them to log into the web interface for the first time.

-##### Load Demo Data
+##### Load demo data

 Optionally, you may also want to load some demo data. This will create a demo project, inventory, and job template.

@@ -206,7 +256,7 @@ In order to launch all developer services:
 `launch_awx.sh` also calls `bootstrap_development.sh` so if all you are doing is launching the supervisor to start all services, you don't
 need to call `bootstrap_development.sh` first.

-### Start a cluster
+### Start a Cluster

 Certain features or bugs are only applicable when running a cluster of AWX nodes. To bring up a 3 node cluster development environment simply run the below command.