Adding a guide on how to enable and use exemplars

Closes #38688

Signed-off-by: Alexander Schwartz <aschwart@redhat.com>
This commit is contained in:
Alexander Schwartz 2025-04-08 12:23:28 +02:00 committed by GitHub
parent 9eb336ae41
commit 2ad776553a
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
7 changed files with 128 additions and 0 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 123 KiB

View File

@ -0,0 +1,101 @@
<#import "/templates/guide.adoc" as tmpl>
<#import "/templates/links.adoc" as links>
<@tmpl.guide
title="Analyzing outliers and errors with exemplars"
summary="Use exemplars to connect a metric to a recorded trace to analyze the root cause of errors or latencies.">
Metrics are aggregations over several events, and show you if your system is operating within defined bounds.
They are great to monitor error rates or tail latencies and to set up alerting or drive performance optimizations.
Still, the aggregation makes it difficult to find root causes for latencies or errors reported in metrics.
Root causes for errors and latencies can be found by enabling tracing.
To connect a metric to a recorded trace, there is the concept of https://grafana.com/docs/grafana/latest/fundamentals/exemplars/[exemplars].
Once exemplars are set up, {project_name} reports metrics with their last recorded trace as an exemplar.
A dashboard tool like Grafana can link the exemplar from a metrics dashboard to a trace view.
Metrics that support exemplars are:
* `http_server_requests_seconds_count` (including histograms) +
See the {section} <@links.observability id="metrics-for-troubleshooting-http"/> for details on this metric.
* `keycloak_credentials_password_hashing_validations_total` +
See the {section} <@links.observability id="metrics-for-troubleshooting-keycloak"/> for details on this metric.
* `keycloak_user_events_total` +
See the {section} <@links.observability id="metrics-for-troubleshooting-keycloak"/> for details on this metric.
See below for a screenshot of a heatmap visualization for latencies that is showing an exemplar when hovering over one of the pink indicators.
.Heatmap diagram with exemplar
image::observability/exemplar.png[]
== Setting up exemplars
To benefit from exemplars, perform the following steps:
. Enable metrics for {project_name} as described in {section} <@links.observability id="configuration-metrics" />.
. Enable tracing for {project_name} as described in {section} <@links.observability id="tracing" />.
. Enable exemplar storage in your monitoring system.
+
For Prometheus, this is a https://prometheus.io/docs/prometheus/latest/feature_flags/#exemplars-storage[preview feature that you need to enable].
. Scrape the metrics using the `OpenMetricsText1.0.0` protocol, which is not enabled by default in Prometheus.
+
If you are using `PodMonitors` or similar in a Kubernetes environment, this can be achieved by adding it to the spec of the custom resource:
+
[source]
----
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
...
spec:
scrapeProtocols:
- OpenMetricsText1.0.0
----
. Configure your metrics datasource where to link to for traces.
+
When using Grafana and Prometheus, this would be setting up a `exemplarTraceIdDestinations` for the Prometheus datasource, which then points to your tracing datasource that is provided by tools like Jaeger or Tempo.
. Enable exemplars in your dashboards.
+
Enable the *Exemplars* toggle in each query on each dashboard where you want to show exemplars.
When set up correctly, you will notice little dots or stars in your dashboards that you can click on to view the traces.
[NOTE]
====
* If you do not specify the scrape protocol, Prometheus will by default not send it in the content negotiation, and Keycloak will then fall back to the PrometheusText protocol which will not contain the exemplars.
* If you enabled tracing and metrics, but the request sampling did not record a trace, the exposed metric will not contain any exemplars.
* If you access the metrics endpoint with your browser, the content negotiation will lead to the format PrometheusText being returned, and you will not see any exemplars.
====
== Verifying that exemplars work as expected
Perform the following steps to verify that {project_name} is set up correctly for exemplars:
. Follow the instructions to set up metrics and tracing for {project_name}.
. For test purposes, record all traces by setting the tracing ration to `1.0`.
See <@links.observability id="tracing" anchor="sampling" /> for recommended sampling settings in production systems.
. Log in to the Keycloak instance to create some traces.
. Scrape the metrics with a command similar to the following and search for those metrics that have an exemplar set:
+
[source]
----
$ curl -s http://localhost:9000/metrics \
-H 'Accept: application/openmetrics-text; version=1.0.0; charset=utf-8' \
| grep "#.*trace_id"
----
+
This should result in an output similar to the following. Note the additional `#` after which the span and trace IDs are added:
+
[source]
----
http_server_requests_seconds_count {...} ... # {span_id="...",trace_id="..."} ...
----
</@tmpl.guide>

View File

@ -14,6 +14,7 @@ This guide provides instructions on how to visualize collected {project_name} me
* {project_name} metrics are enabled. Follow <@links.observability id="configuration-metrics"/> {section} for more details.
* Grafana instance is running and {project_name} metrics are collected into a Prometheus instance.
* For the HTTP request latency heatmaps to work, enable histograms for HTTP metrics by setting `http-metrics-histograms-enabled` to `true`.
== {project_name} Grafana dashboards
@ -93,4 +94,9 @@ Exporting a dashboard to JSON format may be useful. For example, you may want to
++++
</div>
++++
== Further reading
Continue reading on how to connect traces to dashboard in the <@links.observability id="exemplars" /> {section}.
</@tmpl.guide>

View File

@ -5,6 +5,7 @@
title="HTTP metrics"
summary="Learn about metrics for monitoring the {project_name} HTTP requests processing"
tileVisible="false"
includedOptions="http-metrics-histograms-enabled http-metrics-slos"
>
<#include "partials/prerequisites-metrics-troubleshooting.adoc" />
@ -37,6 +38,8 @@ m| http_server_requests_seconds_sum
| The total duration for all the requests processed.
|===
You can enable histograms for this metric by setting `http-metrics-histograms-enabled` to `true`, and add additional buckets for service level objectives using the option `http-metrics-slos`.
include::partials/histogram_note_http.adoc[]
=== Active requests

View File

@ -13,3 +13,4 @@ metrics-for-troubleshooting-embedded-caches-multi-site
metrics-for-troubleshooting-external-infinispan-multi-site
tracing
grafana-dashboards
exemplars

View File

@ -135,6 +135,7 @@ WARNING: For a production-ready environment, sampling should be properly set to
The used sampler can be changed via the `tracing-sampler-type` property.
[[sampling]]
=== Default sampler
The default sampler for {project_name} is `traceidratio`, which controls the rate of trace sampling based on a specified ratio configurable via the `tracing-sampler-ratio` property.

View File

@ -20,6 +20,7 @@ package org.keycloak.it.cli.dist;
import static io.restassured.RestAssured.given;
import static io.restassured.RestAssured.when;
import static org.hamcrest.Matchers.containsString;
import static org.hamcrest.Matchers.matchesPattern;
import static org.hamcrest.Matchers.not;
import static org.junit.jupiter.api.Assertions.assertThrows;
@ -130,6 +131,21 @@ public class MetricsDistTest {
}
@Test
@Launch({ "start-dev", "--metrics-enabled=true", "--tracing-enabled=true" })
void testMetricsEndpointWithCacheMetricsHistogramsAndExemplars(KeycloakDistribution distribution) {
runClientCredentialGrantWithUnknownClientId(distribution);
distribution.setRequestPort(9000);
// Exemplars are only present when metrics and traces are enabled
given().accept("application/openmetrics-text; version=1.0.0; charset=utf-8");
when().get("/metrics").then()
.statusCode(200)
// http_server_requests_seconds_count{method="GET",outcome="CLIENT_ERROR",status="404",uri="NOT_FOUND"} 7.0 # {span_id="59fb88a687095d04",trace_id="a4d15d4deaa6f6ee7ac2da092f292925"} 1.0 1743780073.651
.body(matchesPattern("(?s).*http_server_requests_seconds_count.*,trace_id=.*"));
}
@Test
@Launch({ "start-dev", "--metrics-enabled=true", "--features=user-event-metrics", "--event-metrics-user-enabled=true" })
void testMetricsEndpointWithUserEventMetrics(KeycloakDistribution distribution) {