diff --git a/tools/docker-compose/README.md b/tools/docker-compose/README.md index e13b54a369..9080a6339b 100644 --- a/tools/docker-compose/README.md +++ b/tools/docker-compose/README.md @@ -480,3 +480,8 @@ $ PROMETHEUS=yes GRAFANA=yes make docker-compose 3. Navigate to `http://localhost:3001`. Sign in, using `admin` for both username and password. 4. In the left navigation menu go to Dashboards->Browse, find the "awx-demo" and click. These should have graphs. 5. Now you can modify these and add panels for whichever metrics you like. + +### Alerts in Grafana + +We are configuring alerts in grafana using the provisioning files method. This feature is new in Grafana as of August, 2022. Documentation can be found: https://grafana.com/docs/grafana/latest/administration/provisioning/#alerting however it does not fully show all parameters to the config. One way to understand how to build rules is to build them in the UI and use chrometools to inspect the payload as you save the rules. It appears that the "data" portion of the payload for each rule is the same syntax as needed in the provisioning file config. To reload the alerts without restarting the container, from within the container you can send a POST with `curl -X POST http://admin:admin@localhost:3000/api/admin/provisioning/alerting/relo +ad`. Keep in mind the grafana container does not default contain `curl` and you can get it with `apk add curl`. diff --git a/tools/grafana/alerting/alerts.yml b/tools/grafana/alerting/alerts.yml new file mode 100644 index 0000000000..155bcf9733 --- /dev/null +++ b/tools/grafana/alerting/alerts.yml @@ -0,0 +1,145 @@ +--- +apiVersion: 1 +groups: + - folder: awx + interval: 60s + name: awx_rules + orgId: 1 + rules: + - condition: A + dashboardUid: awx + data: + - datasourceUid: PBFA97CFB590B2093 + model: + editorMode: code + expr: irate(callback_receiver_events_insert_db{node='awx_1'}[1m]) + hide: false + intervalMs: 1000 + legendFormat: __auto + maxDataPoints: 43200 + range: true + refId: events_insertion_rate_per_second + queryType: "" + refId: events_insertion_rate_per_second + relativeTimeRange: + from: 300 + to: 0 + - datasourceUid: -100 + model: + conditions: + - evaluator: + params: + - 3 + type: gt + operator: + type: and + query: + params: + - event_insertion_rate + reducer: + params: [] + type: last + type: query + datasource: + type: __expr__ + uid: -100 + expression: events_insertion_rate_per_second + hide: false + intervalMs: 1000 + maxDataPoints: 43200 + reducer: mean + refId: mean_event_insertion_rate + type: reduce + queryType: "" + refId: mean_event_insertion_rate + relativeTimeRange: + from: 0 + to: 0 + - datasourceUid: PBFA97CFB590B2093 + model: + datasource: + type: prometheus + uid: PBFA97CFB590B2093 + editorMode: code + expr: callback_receiver_events_queue_size_redis{node='awx_1'} + hide: false + intervalMs: 1000 + legendFormat: __auto + maxDataPoints: 43200 + range: true + refId: redis_queue_size + queryType: "" + refId: redis_queue_size + relativeTimeRange: + from: 300 + to: 0 + - datasourceUid: -100 + model: + conditions: + - evaluator: + params: + - 3 + type: gt + operator: + type: and + query: + params: + - event_insertion_rate + reducer: + params: [] + type: last + type: query + datasource: + type: __expr__ + uid: -100 + expression: redis_queue_size + hide: false + intervalMs: 1000 + maxDataPoints: 43200 + reducer: last + refId: mean_redis_queue_size + type: reduce + queryType: "" + refId: mean_redis_queue_size + relativeTimeRange: + from: 0 + to: 0 + - datasourceUid: -100 + model: + conditions: + - evaluator: + params: + - 0 + - 0 + type: gt + operator: + type: and + query: + params: + - mean_redis_queue_size + reducer: + params: [] + type: avg + type: query + datasource: + name: Expression + type: __expr__ + uid: __expr__ + expression: '( + ${mean_redis_queue_size} > + ($mean_event_insertion_rate\ * 120))' + hide: false + intervalMs: 1000 + maxDataPoints: 43200 + refId: redis_queue_growing_faster_than_insertion_rate + type: math + queryType: "" + refId: redis_queue_growing_faster_than_insertion_rate + relativeTimeRange: + from: 0 + to: 0 + for: 60s + noDataState: OK + panelId: 1 + title: redis_queue_too_large_to_clear_in_2_min + uid: redis_queue_too_large_to_clear_in_2_min