* Make quiter the daphne logs by raising the level to INFO instead of
DEBUG
* Output the django channels name of broadcast clients. This way, if the
queue gets backed up, we can find it in redis.
* zcard asgi::group:jobs-status_changed <-- to see a group set that
continues to grow. Issue this command in a loop while refreshing the
browser page on the jobs list. Before this change the set size would
continue to grow as daphne channel names are added to the group. After
this change the set size stays stable at the expected, 1.
* Replying to websocket group membership with the previous state, delta,
and new state has shown to be quite stable. This debug message is not
very helpful and is noisy in the dev env. This change removes the debug
message.
* Gather brroadcast websocket metrics and push them into redis every
configurable seconds.
* Pop metrics from redis in web view layer to display via the api on
demand
* Sending health about websockets over websockets is not a great idea.
* I tried sending health data via prometheus and encountered problems
that will need PR's to prometheus_client library to solve. Circle back
to this later.
* This change adds more than just an unsubscribe reply.
* Websockets canrequest to join/leave groups. They do so using a single
idempotent request. This change replies to group requests over the
websockets with the diff of the group subscription. i.e. what groups the
user currenntly is in, what groups were left, and what groups were
joined.
* User in channels session is a lazy user class. This does not conform
to what the generic Role ancestry code expects. The Role ancestry code
expects a User objects. This change converts the lazy object into a
proper User object before calling the permission code path.
* asgiref async_to_sync was causing a Redis connection _for each_ call
to emit_channel_notification i.e. every event that the callback receiver
processes. This is a "known" issue
https://github.com/django/channels_redis/pull/130#issuecomment-424274470
and the advise is to slow downn the rate at which you call
async_to_sync. That is not an option for us. Instead, we put the async
group_send call onto the event loop for the current thread and wait for
it to be processed immediately.
The known issue has to do with event loop + socket relationship. Each
connection to redis is achieved via a socket. That conection can only be
waiting on by the event loop that corresponds to the calling thread.
async_to_sync creates a _new thread_ for each invocation. Thus, a new
connection to redis is required. Thus, the excess redis connections that
can be observed via netstat | grep redis | wc -l.
now that we have the CSRF middleware, we have a reliable token
available to us which we can use to verify individual ws_receive
payloads; this is _simpler_ than making sure you've properly configured
trusted origins, and it's also more secure than Origin header checks
see: https://github.com/ansible/tower/issues/2661
There's a race between our `ws_connect` and `ws_receive` methods;
it's possible to fall into a scenario where we're handling a legitimate
message *before* django-channels is able to persist the `user_id` into
the channel session. This results in a scenario where a user can open
a browser tab and never receive new websocket messages. In this
scenario, we should just toss the message back into the queue and try
again later (up to a reasonable limit of retries).