add alerting rule to grafana

This rule alerts if the redis queue is larger than what the rolling
average event insertion rate/second * 120. In other words, if the redis
queue is larger than it appears we can process events in two minutes.

It appears it has to meet this condition for 60 seconds to start firing.

Future commits will address how to configure contact points like slack.

shout out to @jainnikhil30 and @rebeccahhh who figured this out in jam
session this morning.
This commit is contained in:
Elijah DeLee 2022-09-07 13:01:10 -04:00
parent a227fea5ef
commit 10d06f219d
2 changed files with 150 additions and 0 deletions

View File

@ -480,3 +480,8 @@ $ PROMETHEUS=yes GRAFANA=yes make docker-compose
3. Navigate to `http://localhost:3001`. Sign in, using `admin` for both username and password.
4. In the left navigation menu go to Dashboards->Browse, find the "awx-demo" and click. These should have graphs.
5. Now you can modify these and add panels for whichever metrics you like.
### Alerts in Grafana
We are configuring alerts in grafana using the provisioning files method. This feature is new in Grafana as of August, 2022. Documentation can be found: https://grafana.com/docs/grafana/latest/administration/provisioning/#alerting however it does not fully show all parameters to the config. One way to understand how to build rules is to build them in the UI and use chrometools to inspect the payload as you save the rules. It appears that the "data" portion of the payload for each rule is the same syntax as needed in the provisioning file config. To reload the alerts without restarting the container, from within the container you can send a POST with `curl -X POST http://admin:admin@localhost:3000/api/admin/provisioning/alerting/relo
ad`. Keep in mind the grafana container does not default contain `curl` and you can get it with `apk add curl`.

View File

@ -0,0 +1,145 @@
---
apiVersion: 1
groups:
- folder: awx
interval: 60s
name: awx_rules
orgId: 1
rules:
- condition: A
dashboardUid: awx
data:
- datasourceUid: PBFA97CFB590B2093
model:
editorMode: code
expr: irate(callback_receiver_events_insert_db{node='awx_1'}[1m])
hide: false
intervalMs: 1000
legendFormat: __auto
maxDataPoints: 43200
range: true
refId: events_insertion_rate_per_second
queryType: ""
refId: events_insertion_rate_per_second
relativeTimeRange:
from: 300
to: 0
- datasourceUid: -100
model:
conditions:
- evaluator:
params:
- 3
type: gt
operator:
type: and
query:
params:
- event_insertion_rate
reducer:
params: []
type: last
type: query
datasource:
type: __expr__
uid: -100
expression: events_insertion_rate_per_second
hide: false
intervalMs: 1000
maxDataPoints: 43200
reducer: mean
refId: mean_event_insertion_rate
type: reduce
queryType: ""
refId: mean_event_insertion_rate
relativeTimeRange:
from: 0
to: 0
- datasourceUid: PBFA97CFB590B2093
model:
datasource:
type: prometheus
uid: PBFA97CFB590B2093
editorMode: code
expr: callback_receiver_events_queue_size_redis{node='awx_1'}
hide: false
intervalMs: 1000
legendFormat: __auto
maxDataPoints: 43200
range: true
refId: redis_queue_size
queryType: ""
refId: redis_queue_size
relativeTimeRange:
from: 300
to: 0
- datasourceUid: -100
model:
conditions:
- evaluator:
params:
- 3
type: gt
operator:
type: and
query:
params:
- event_insertion_rate
reducer:
params: []
type: last
type: query
datasource:
type: __expr__
uid: -100
expression: redis_queue_size
hide: false
intervalMs: 1000
maxDataPoints: 43200
reducer: last
refId: mean_redis_queue_size
type: reduce
queryType: ""
refId: mean_redis_queue_size
relativeTimeRange:
from: 0
to: 0
- datasourceUid: -100
model:
conditions:
- evaluator:
params:
- 0
- 0
type: gt
operator:
type: and
query:
params:
- mean_redis_queue_size
reducer:
params: []
type: avg
type: query
datasource:
name: Expression
type: __expr__
uid: __expr__
expression: '(
${mean_redis_queue_size} >
($mean_event_insertion_rate\ * 120))'
hide: false
intervalMs: 1000
maxDataPoints: 43200
refId: redis_queue_growing_faster_than_insertion_rate
type: math
queryType: ""
refId: redis_queue_growing_faster_than_insertion_rate
relativeTimeRange:
from: 0
to: 0
for: 60s
noDataState: OK
panelId: 1
title: redis_queue_too_large_to_clear_in_2_min
uid: redis_queue_too_large_to_clear_in_2_min