[dev docs] Re-document websockets infrastructure (#13992)

Re-add documentation for how AWX websockets and channels work, in the post-web/task split world. Signed-off-by: Rick Elrod <rick@elrod.me>
2026-03-03 09:48:51 -03:30 · 2023-05-22 16:41:23 -05:00
parent 5afdfb1135
commit 88d1a484fa
3 changed files with 197 additions and 5 deletions
--- a/docs/img/websockets-new.png
+++ b/docs/img/websockets-new.png
--- a/docs/img/websockets-old.png
+++ b/docs/img/websockets-old.png
--- a/docs/websockets.md
+++ b/docs/websockets.md
@@ -1,9 +1,201 @@
-# Channels Overview
+# Websockets

-Our channels/websocket implementation handles the communication between AWX API and updates in AWX UI. Note that the websockets are not meant for external use and 3rd party clients.
+AWX uses websockets to update the UI in realtime as events happen within the
+task backend. As jobs progress through their lifecycles, events are emitted by
+the task system and send to the UI which renders this information in some useful
+way for end users to see.

-## Architecture
+In this document, we do not intend to provide a code walkthrough, but an
+architectural overview of how the infrastructure works. We start with a summary
+of things worked historically, and then focus primarily on how things work now.

-AWX enlists the help of the `django-channels` library to create our communications layer. `django-channels` provides us with per-client messaging integration in our application by implementing the Asynchronous Server Gateway Interface (ASGI).
+Since the [web/task split](https://github.com/ansible/awx/pull/13423) landed,
+the websockets infrastructure needed to be reworked. As we will see below, the
+system makes use of
+[django-channels](https://channels.readthedocs.io/en/stable/) which in turn
+makes use of Redis. As we wanted to have separate deployments for the web (API)
+component and the task backend component, we needed to no longer depend on a
+shared Redis between these pieces.

-**TODO: The information previously contained within this document has become outdated. This document will be rewritten in the near future.**
+When a message gets sent, it lands in Redis (which is used as a pub/sub server
+backend for django-channels), and django-channels acts on it and forwards it on
+to clients. (There is a concept of "groups" of clients that we will cover
+later.)
+
+Note that django-channels processes messages in Daphne, an ASGI server. This is
+in contrast to WSGI which we currently use for the rest of AWX. That is *only*
+websockets endpoints are served with Daphne, and everything else is served with
+uWSGI. **_Everything_ goes through nginx, and from there, based on URL path,
+gets routed to uWSGI or Daphne, as required.**
+
+The notable modules for this component are:
+
+* `awx/main/consumers.py` - the django-channels "consumers" where websocket
+  clients are actually handled (think of these like views, but specific to
+  websockets).
+
+  * This is also where `emit_channel_notification` is defined, which is what the
+    task system calls to actually send out notifications. This sends a message
+    to the local Redis instance where wsrelay (discussed below) will pick it
+    up and relay it out to the web nodes.
+
+* `awx/main/wsrelay.py` (formerly `awx/main/wsbroadcast.py`) - an asyncio
+  websockets _client_ that connects to the "relay" (fka. "broadcast")
+  endpoint. This is a daemon. It formerly ran in each web container, but now
+  runs in each task container instead.
+
+* `awx/main/management/commands/run_heartbeet.py` - discussed below, used to
+  send a heartbeat payload to pg_notify every few seconds, so that all task
+  pods running `wsrelay.py` (above) know about each web pod.
+
+## Before the web/task split
+
+<img src="img/websockets-old.png">
+
+Consider a Kubernetes deployment of AWX. Before the web task split, each pod had
+a web container, a task container, and a redis container (and possibly others,
+not relevant here). A daemon existed called `wsbroadcast` which lived in the web
+container and ran once within each pod. When a websocket message was emitted
+(from some task system daemon, for example), the message would get sent to a
+django-channels "group" which would store it in the pod-local Redis before
+django-channels acted on it. It would also get sent to a special "broadcast"
+group.
+
+The `wsbroadcast` daemon acted as a full websocket client and connected to a
+special internal endpoint (no end-user clients connected to this
+endpoint). Importantly, **it connected to this endpoint on all pods (nods)
+except for the one it was running on**. Immediately after connecting, it got
+added to the "broadcast" group on each pod and would thus receive a copy of each
+message being sent from each node.
+
+It would then "re-send" this message to its local clients by storing it in its
+own pod-local Redis for django-channels to process it.
+
+## Current Implementation
+
+<img src="img/websockets-new.png">
+
+In the post web/task split world, web and task containers live in entirely
+independent pods, each with their own Redis. The former `wsbroadcast` has been
+renamed to `wsrelay`, and we have moved to a kind of "fan-out" architecture.
+
+Messages (only) originate from the task system (container). The `wsrelay` daemon
+initiates a websocket connection to each web pod on its "relay" endpoint (which
+corresponds to the `RelayConsumer` class). As a task progresses and events are
+emitted, websocket messages get sent to Redis (via django-channels) and
+`wsrelay` running in the same pod picks them up. It then takes each message and
+passes it along to every web pod, via the websocket connection it already has
+open.
+
+There is actually a bit more going on here, and the interested reader is invited
+to read through the `EventConsumer` class and `wsrelay` code for the full
+deep-dive. But, briefly, a few pieces not covered above:
+
+### Only Relaying Messages When Necessary
+
+The `RelayConsumer` class can communicate to and from the `wsrelay`
+client. Because of this, we are able to inform `wsrelay` when there is a client
+actively listening for websocket updates. Then, `wsrelay` can keep note of that
+(and which web pod they are coming from), and only send messages to web pods
+which have active clients on them. This prevents us from sending lots of
+messages to web pods which immediately discard them due to having no clients
+that care about them.
+
+### The Heartbeet
+
+There is also a "heartbeet" system (a play on "heartbeat"), that goes along with
+the above. Remember that `wsrelay` lives in each task pod, and there could be an
+arbitrary number of web and task pods (independent of each other). Because of
+this, `wsrelay` (on all task pods) needs to know which web pods are up and need
+to be connected to (on their "relay" endpoints). To accomplish this, we use
+pg_notify, since web and task pods are all able to connect to the database and
+we are safely able to use it as a central communication point.
+
+In each web container, there is a process, `run_heartbeet.py` which will send
+out a heartbeat payload to pg_notify, every
+`settings.BROADCAST_WEBSOCKET_BEACON_FROM_WEB_RATE_SECONDS` seconds. This is
+done in a broadcast fashion to a specific pg_notify channel, and each `wsrelay`
+instance (running in each task pod) is concurrently listening to pg_notify for
+these messages. When `wsrelay` sees this heartbeat packet, it checks to see if
+the web node is known already. If not, it creates a connection to it, and adds
+it to its list of nodes to relay websocket messages to.
+
+It can also handle web nodes going offline. If `run_heartbeet.py` detects
+SIGTERM or SIGINT, it will send an "offline" heartbeat packet, and `wsrelay`
+will work to *remove* the web node from its list of active connections.
+
+This also gracefully handles the situation where there is a network blip, or a
+web pod *unexpectedly* goes offline. Eventually the connection to the node will
+time out and `wsrelay` will remove the pod from its list. The next time it sees
+a heartbeat from the node, it will reconnect as normal. Similarly, this handles
+new web pods scaling up at random.
+
+### The Relay Endpoint and Security
+
+The relay endpoint, the endpoint that `wsrelay` connects to as a client, is
+protected so that only `wsrelay` can connect to it. This is important, as
+otherwise anyone could connect to the endpoint and see information (such as
+playbook output) that they might not otherwise have necessary permissions to
+see.
+
+Authentication is accomplished via a shared secret that is generated and set at
+playbook install time. The shared secret is used to derive a payload that is
+exchanged via the http(s) header `secret`. The shared secret payload consists of
+a `secret` field, containing the shared secret, and a `nonce` which is used to
+mitigate replay attack windows.
+
+Note that the nonce timestamp is considered valid if it is within 300 second
+threshold. This is to allow for machine clock skews.
+
+```python
+{
+    "secret": settings.BROADCAST_WEBSOCKET_SECRET,
+    "nonce": time.now()
+}
+```
+
+The payload is encrypted using HMAC-SHA256 with
+`settings.BROADCAST_WEBSOCKET_SECRET` as the key. The final payload that is
+sent, including the http header, is of the form:
+
+`secret: nonce_plaintext:HMAC_SHA256({"secret": settings.BROADCAST_WEBSOCKET_SECRET, "nonce": nonce_plaintext})`
+
+Upon receiving the payload, `RelayConsumer` decrypts the `secret` header using
+the known shared secret and ensures the secret value of the decrypted payload
+matches the known shared secret, `settings.BROADCAST_WEBSOCKET_SECRET`. If it
+does not match, the connection is closed. If it does match, the nonce is
+compared to the current time. If the nonce is off by more than 300 seconds, the
+connection is closed. If both tests pass, the connection is accepted.
+
+## Protocol
+
+You can connect to the AWX channels implementation using any standard websocket
+library by pointing it to `/websocket/`. You must provide a valid Auth Token in
+the request URL. This will drop you into the `EventConsumer`.
+
+Once you've connected, you are not subscribed to any event groups. You subscribe
+by sending a JSON payload that looks like the following:
+
+```json
+"groups": {
+    "jobs": ["status_changed", "summary"],
+    "schedules": ["changed"],
+    "ad_hoc_command_events": [ids...],
+    "job_events": [ids...],
+    "workflow_events": [ids...],
+    "project_update_events": [ids...],
+    "inventory_update_events": [ids...],
+    "system_job_events": [ids...],
+    "control": ["limit_reached_<user_id>"],
+}
+```
+
+These map to the event group and event type that the user is interested
+in. Sending in a new groups dictionary will clear all previously-subscribed
+groups before subscribing to the newly requested ones. This is intentional, and
+makes the single page navigation much easier since users only need to care about
+current subscriptions.
+
+Note that, as mentioned above, `wsrelay` will only relay messages to a web pod
+if there is a user actively listening for a message of whatever type is being
+sent.