mirror of
https://github.com/ansible/awx.git
synced 2026-01-09 15:02:07 -03:30
[dev docs] Re-document websockets infrastructure (#13992)
Re-add documentation for how AWX websockets and channels work, in the post-web/task split world. Signed-off-by: Rick Elrod <rick@elrod.me>
This commit is contained in:
parent
5afdfb1135
commit
88d1a484fa
BIN
docs/img/websockets-new.png
Normal file
BIN
docs/img/websockets-new.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 578 KiB |
BIN
docs/img/websockets-old.png
Normal file
BIN
docs/img/websockets-old.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 679 KiB |
@ -1,9 +1,201 @@
|
||||
# Channels Overview
|
||||
# Websockets
|
||||
|
||||
Our channels/websocket implementation handles the communication between AWX API and updates in AWX UI. Note that the websockets are not meant for external use and 3rd party clients.
|
||||
AWX uses websockets to update the UI in realtime as events happen within the
|
||||
task backend. As jobs progress through their lifecycles, events are emitted by
|
||||
the task system and send to the UI which renders this information in some useful
|
||||
way for end users to see.
|
||||
|
||||
## Architecture
|
||||
In this document, we do not intend to provide a code walkthrough, but an
|
||||
architectural overview of how the infrastructure works. We start with a summary
|
||||
of things worked historically, and then focus primarily on how things work now.
|
||||
|
||||
AWX enlists the help of the `django-channels` library to create our communications layer. `django-channels` provides us with per-client messaging integration in our application by implementing the Asynchronous Server Gateway Interface (ASGI).
|
||||
Since the [web/task split](https://github.com/ansible/awx/pull/13423) landed,
|
||||
the websockets infrastructure needed to be reworked. As we will see below, the
|
||||
system makes use of
|
||||
[django-channels](https://channels.readthedocs.io/en/stable/) which in turn
|
||||
makes use of Redis. As we wanted to have separate deployments for the web (API)
|
||||
component and the task backend component, we needed to no longer depend on a
|
||||
shared Redis between these pieces.
|
||||
|
||||
**TODO: The information previously contained within this document has become outdated. This document will be rewritten in the near future.**
|
||||
When a message gets sent, it lands in Redis (which is used as a pub/sub server
|
||||
backend for django-channels), and django-channels acts on it and forwards it on
|
||||
to clients. (There is a concept of "groups" of clients that we will cover
|
||||
later.)
|
||||
|
||||
Note that django-channels processes messages in Daphne, an ASGI server. This is
|
||||
in contrast to WSGI which we currently use for the rest of AWX. That is *only*
|
||||
websockets endpoints are served with Daphne, and everything else is served with
|
||||
uWSGI. **_Everything_ goes through nginx, and from there, based on URL path,
|
||||
gets routed to uWSGI or Daphne, as required.**
|
||||
|
||||
The notable modules for this component are:
|
||||
|
||||
* `awx/main/consumers.py` - the django-channels "consumers" where websocket
|
||||
clients are actually handled (think of these like views, but specific to
|
||||
websockets).
|
||||
|
||||
* This is also where `emit_channel_notification` is defined, which is what the
|
||||
task system calls to actually send out notifications. This sends a message
|
||||
to the local Redis instance where wsrelay (discussed below) will pick it
|
||||
up and relay it out to the web nodes.
|
||||
|
||||
* `awx/main/wsrelay.py` (formerly `awx/main/wsbroadcast.py`) - an asyncio
|
||||
websockets _client_ that connects to the "relay" (fka. "broadcast")
|
||||
endpoint. This is a daemon. It formerly ran in each web container, but now
|
||||
runs in each task container instead.
|
||||
|
||||
* `awx/main/management/commands/run_heartbeet.py` - discussed below, used to
|
||||
send a heartbeat payload to pg_notify every few seconds, so that all task
|
||||
pods running `wsrelay.py` (above) know about each web pod.
|
||||
|
||||
## Before the web/task split
|
||||
|
||||
<img src="img/websockets-old.png">
|
||||
|
||||
Consider a Kubernetes deployment of AWX. Before the web task split, each pod had
|
||||
a web container, a task container, and a redis container (and possibly others,
|
||||
not relevant here). A daemon existed called `wsbroadcast` which lived in the web
|
||||
container and ran once within each pod. When a websocket message was emitted
|
||||
(from some task system daemon, for example), the message would get sent to a
|
||||
django-channels "group" which would store it in the pod-local Redis before
|
||||
django-channels acted on it. It would also get sent to a special "broadcast"
|
||||
group.
|
||||
|
||||
The `wsbroadcast` daemon acted as a full websocket client and connected to a
|
||||
special internal endpoint (no end-user clients connected to this
|
||||
endpoint). Importantly, **it connected to this endpoint on all pods (nods)
|
||||
except for the one it was running on**. Immediately after connecting, it got
|
||||
added to the "broadcast" group on each pod and would thus receive a copy of each
|
||||
message being sent from each node.
|
||||
|
||||
It would then "re-send" this message to its local clients by storing it in its
|
||||
own pod-local Redis for django-channels to process it.
|
||||
|
||||
## Current Implementation
|
||||
|
||||
<img src="img/websockets-new.png">
|
||||
|
||||
In the post web/task split world, web and task containers live in entirely
|
||||
independent pods, each with their own Redis. The former `wsbroadcast` has been
|
||||
renamed to `wsrelay`, and we have moved to a kind of "fan-out" architecture.
|
||||
|
||||
Messages (only) originate from the task system (container). The `wsrelay` daemon
|
||||
initiates a websocket connection to each web pod on its "relay" endpoint (which
|
||||
corresponds to the `RelayConsumer` class). As a task progresses and events are
|
||||
emitted, websocket messages get sent to Redis (via django-channels) and
|
||||
`wsrelay` running in the same pod picks them up. It then takes each message and
|
||||
passes it along to every web pod, via the websocket connection it already has
|
||||
open.
|
||||
|
||||
There is actually a bit more going on here, and the interested reader is invited
|
||||
to read through the `EventConsumer` class and `wsrelay` code for the full
|
||||
deep-dive. But, briefly, a few pieces not covered above:
|
||||
|
||||
### Only Relaying Messages When Necessary
|
||||
|
||||
The `RelayConsumer` class can communicate to and from the `wsrelay`
|
||||
client. Because of this, we are able to inform `wsrelay` when there is a client
|
||||
actively listening for websocket updates. Then, `wsrelay` can keep note of that
|
||||
(and which web pod they are coming from), and only send messages to web pods
|
||||
which have active clients on them. This prevents us from sending lots of
|
||||
messages to web pods which immediately discard them due to having no clients
|
||||
that care about them.
|
||||
|
||||
### The Heartbeet
|
||||
|
||||
There is also a "heartbeet" system (a play on "heartbeat"), that goes along with
|
||||
the above. Remember that `wsrelay` lives in each task pod, and there could be an
|
||||
arbitrary number of web and task pods (independent of each other). Because of
|
||||
this, `wsrelay` (on all task pods) needs to know which web pods are up and need
|
||||
to be connected to (on their "relay" endpoints). To accomplish this, we use
|
||||
pg_notify, since web and task pods are all able to connect to the database and
|
||||
we are safely able to use it as a central communication point.
|
||||
|
||||
In each web container, there is a process, `run_heartbeet.py` which will send
|
||||
out a heartbeat payload to pg_notify, every
|
||||
`settings.BROADCAST_WEBSOCKET_BEACON_FROM_WEB_RATE_SECONDS` seconds. This is
|
||||
done in a broadcast fashion to a specific pg_notify channel, and each `wsrelay`
|
||||
instance (running in each task pod) is concurrently listening to pg_notify for
|
||||
these messages. When `wsrelay` sees this heartbeat packet, it checks to see if
|
||||
the web node is known already. If not, it creates a connection to it, and adds
|
||||
it to its list of nodes to relay websocket messages to.
|
||||
|
||||
It can also handle web nodes going offline. If `run_heartbeet.py` detects
|
||||
SIGTERM or SIGINT, it will send an "offline" heartbeat packet, and `wsrelay`
|
||||
will work to *remove* the web node from its list of active connections.
|
||||
|
||||
This also gracefully handles the situation where there is a network blip, or a
|
||||
web pod *unexpectedly* goes offline. Eventually the connection to the node will
|
||||
time out and `wsrelay` will remove the pod from its list. The next time it sees
|
||||
a heartbeat from the node, it will reconnect as normal. Similarly, this handles
|
||||
new web pods scaling up at random.
|
||||
|
||||
### The Relay Endpoint and Security
|
||||
|
||||
The relay endpoint, the endpoint that `wsrelay` connects to as a client, is
|
||||
protected so that only `wsrelay` can connect to it. This is important, as
|
||||
otherwise anyone could connect to the endpoint and see information (such as
|
||||
playbook output) that they might not otherwise have necessary permissions to
|
||||
see.
|
||||
|
||||
Authentication is accomplished via a shared secret that is generated and set at
|
||||
playbook install time. The shared secret is used to derive a payload that is
|
||||
exchanged via the http(s) header `secret`. The shared secret payload consists of
|
||||
a `secret` field, containing the shared secret, and a `nonce` which is used to
|
||||
mitigate replay attack windows.
|
||||
|
||||
Note that the nonce timestamp is considered valid if it is within 300 second
|
||||
threshold. This is to allow for machine clock skews.
|
||||
|
||||
```python
|
||||
{
|
||||
"secret": settings.BROADCAST_WEBSOCKET_SECRET,
|
||||
"nonce": time.now()
|
||||
}
|
||||
```
|
||||
|
||||
The payload is encrypted using HMAC-SHA256 with
|
||||
`settings.BROADCAST_WEBSOCKET_SECRET` as the key. The final payload that is
|
||||
sent, including the http header, is of the form:
|
||||
|
||||
`secret: nonce_plaintext:HMAC_SHA256({"secret": settings.BROADCAST_WEBSOCKET_SECRET, "nonce": nonce_plaintext})`
|
||||
|
||||
Upon receiving the payload, `RelayConsumer` decrypts the `secret` header using
|
||||
the known shared secret and ensures the secret value of the decrypted payload
|
||||
matches the known shared secret, `settings.BROADCAST_WEBSOCKET_SECRET`. If it
|
||||
does not match, the connection is closed. If it does match, the nonce is
|
||||
compared to the current time. If the nonce is off by more than 300 seconds, the
|
||||
connection is closed. If both tests pass, the connection is accepted.
|
||||
|
||||
## Protocol
|
||||
|
||||
You can connect to the AWX channels implementation using any standard websocket
|
||||
library by pointing it to `/websocket/`. You must provide a valid Auth Token in
|
||||
the request URL. This will drop you into the `EventConsumer`.
|
||||
|
||||
Once you've connected, you are not subscribed to any event groups. You subscribe
|
||||
by sending a JSON payload that looks like the following:
|
||||
|
||||
```json
|
||||
"groups": {
|
||||
"jobs": ["status_changed", "summary"],
|
||||
"schedules": ["changed"],
|
||||
"ad_hoc_command_events": [ids...],
|
||||
"job_events": [ids...],
|
||||
"workflow_events": [ids...],
|
||||
"project_update_events": [ids...],
|
||||
"inventory_update_events": [ids...],
|
||||
"system_job_events": [ids...],
|
||||
"control": ["limit_reached_<user_id>"],
|
||||
}
|
||||
```
|
||||
|
||||
These map to the event group and event type that the user is interested
|
||||
in. Sending in a new groups dictionary will clear all previously-subscribed
|
||||
groups before subscribing to the newly requested ones. This is intentional, and
|
||||
makes the single page navigation much easier since users only need to care about
|
||||
current subscriptions.
|
||||
|
||||
Note that, as mentioned above, `wsrelay` will only relay messages to a web pod
|
||||
if there is a user actively listening for a message of whatever type is being
|
||||
sent.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user