Creating Monitoring Dashboards
Guidelines for developers
Recently our teams at Hotels.com™, part of Expedia Group™, started moving from Graphite to an internal metrics platform that is based on Prometheus. We saw this as an opportunity to improve our observability and, among others, we provided a set of simple guidelines to help with the migration.
We believe these guidelines would be useful to the community and hence we share them in this blog post. Some of the examples apply to our tech stack (i.e. Spring Boot, Micrometer, Kubernetes) but the idea is the same for other technologies and libraries.
Purpose of this Guide
Having meaningful and carefully crafted monitoring dashboards for your services is of utmost importance. The purpose of this guide is to:
- Provide you with a set of handful resources around monitoring
- Promote best practices on monitoring metrics and dashboards
- Help you create Grafana dashboards based on Prometheus metrics
If you want to learn more about monitoring and best practices, we suggest you to read the following resources by Google:
Site Reliability Engineering, How Google runs production systems (Chapter 6 — Monitoring Distributed Systems)
The Site Reliability Workbook, Practical ways to implement SRE (Chapter 4 — Monitoring)
Below is a non-exhaustive list of principles to have in mind in the context of observability which also apply to dashboards:
- Keep it simple, avoid creating complex dashboards that you will never use or alerts that can trigger false-positive notifications.
- Keep it consistent, use consistent and meaningful names in your dashboards and alerts.
- Use logs, metrics, and traces wisely and in conjunction with each other.
- Avoid high-cardinality metrics.
- Avoid complex and slow queries in your dashboards.
What to Monitor
This could take the form of percentiles (e.g. p90, p99). Be aware of failed requests which would result in misleading calculations.
An example of this would be the number of requests per second (RPS).
This will depend on what you consider as an error for your service or system. A typical metric could be the rate of non-2XX status code responses.
Saturation shows how overloaded your service or system is. This could be monitoring the number of elements in a queue. You may also want to look into utilisation which reflects how busy the service is. An example of that is monitoring the busy threads.
Ideally, you need to discuss and decide on this set of metrics with your product owner as they are based on business needs. Business metrics could be custom metrics reported by one or more services.
Indicative examples are listed below:
- A team responsible for sign-ins would need to report metrics for sign-in attempts, failed attempts due to invalid passwords, or even sign-ins coming from different channels but still hitting the same endpoint.
- A team owning the autocomplete functionality across multiple brands would need to monitor the number of requests and error rates per brand.
In a micro-services architecture, there could be many external calls from your service to other services. These calls are usually wrapped with Hystrix or other Circuit Breaker libraries. Monitoring core metrics (traffic, latencies, errors) for these calls is very important.
Connection Pools & Thread Pools Metrics
Having a dashboard that displays metrics for Tomcat threads, Circuit Breaker thread pools and HTTP client connection pools for 3rd party calls is useful.
Useful metrics for JVM applications include memory and CPU, GC, or even memory pools. We suggest re-using the JVM (Micrometer) Grafana dashboard.
Many services rely on infrastructures such as a cache, a database, or a queue. Even if your team does not own these components, monitoring them can help you identify the root cause of an issue. Although the 4 golden signals apply to most infrastructure systems, these systems can also have extra characteristics you need to monitor (e.g. the size of the queue or cache hits/misses).
In addition to infrastructure metrics you may need to monitor Platform metrics, such as ones provided by Kubernetes or by the Service Mesh (e.g. Istio). Usually incident response and SRE teams look into such dashboards to have the big picture and to achieve faster Mean Time to Detect (MTTD) and Mean Time To Recover (MTTR).
Prometheus Best Practices
Be super careful with high-cardinality metrics. As stated in the docs:
Every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
Popular metrics libraries may have mechanisms in place to prevent this issue. For example, Micrometer provides the
maximumAllowableTags method through its Meter Filters. Recent versions of Spring Boot Actuator use this by default for URI tags; they expose the
management.metrics.web.client.max-uri-tags property with a default value of 100 (you may need to decrease that value though). If your library doesn't provide this out-of-the-box you will need to implement this logic.
Let’s now look at practical examples you can re-use. Before we dive deep into queries, understanding the Prometheus format is crucial.
Understanding the Prometheus Format
If you hit the
/prometheus endpoint under which your application exposes Prometheus metrics, you will see a set of metrics:
Taking the last two lines as an example, the name of the metric is
http_server_requests_seconds_count and they both contain a set of labels such as the application name
app, the endpoint
uri, etc. In this case, the only difference is the
This is a representation of a single metric across multiple dimensions, by using labels. Having these multiple dimensions allows us to run powerful queries that could span across multiple URLs, AWS regions, and even across different applications.
Now that we have a basic understanding of the metrics format we can look into useful queries. This section includes very basic examples but you can use them as a starting point.
RPS — Overall
The following query shows the Requests Per Second (RPS) across all endpoints:
http_server_requests_seconds_countstores the count of HTTP requests.
appis a label that reflects the name of the application. You can use a regex and the '=~' operator for a set of applications.
- We append the time selector
[1m]which translates the instant vector into a range vector (over the last minute).
- Up to this point, we have a range vector which we need to transform into an instant vector in order for it to be displayed. We do this by applying the rate function which shows per second increase.
- Finally, we aggregate the results using the
RPS — Aggregations
Often you need to aggregate results per label. For example, plot the RPS per Kubernetes pod, per endpoint, or even per client.
To show the RPS per pod:
For the RPS per endpoint and client you can use the
client labels respectively. In these cases, as mentioned earlier in this guide, you need to be mindful of high-cardinality issues.
To show the latency (e.g. p99) per endpoint you can use the following query:
http_server_requests_secondsstores the latency of HTTP requests.
quantile=0.99gives the p99. You can read more about quantiles.
- Finally, we aggregate the results per endpoint using the
Note that you can calculate quantiles from both histograms and summaries.
Quantiles may not be reported by default by your application. For example, for Spring Boot applications you need to set the
management.metrics.distribution.percentilesfor this. We recommend reporting the p50, p75, p90, p95, p99, p999 percentiles and defining a variable for these in your dashboards.
If you want to include or exclude particular endpoints you can do this with the
urilabel. For instance
uri=~"/api/v1/.*"will only plot endpoints under the
uri!~"/swagger.*"will exclude the Swagger endpoints.
Failed requests are not representative examples of latency as they could fail fast (e.g. 500) or take a lot of time to complete if a timeout is not in place or is mis-configured. We recommend visualising latencies for successful requests and, if needed, having another panel for tracking latencies for failed requests.
The simplest way to visualise your errors is by using a Stat panel, similar to Figure 5.
A more descriptive way would be to visualise success rates per endpoint:
It is important to be able to identify issues with your dependencies. The same signals can be used to monitor such calls. You can get these metrics from Circuit Breaker libraries such as Hystrix or Resilience4J.
To check which metrics are exposed by your Circuit Breaker you can either go through the documentation or hit your
Hystrix uses keys (in particular command keys and command group keys) to identify and group commands. These are available as
grouplabels when using the Hystrix metrics publisher. The result of the call is stored in the
Resilience4J exposes the
kindlabels as documented. The
nameis used to identify the call while the
kindis the result.
The following queries return the RPS per Kubernetes pod for a selected key/name:
To plot the latency for a selected quantile (e.g. 0.99):
Finally, for errors:
Building panels for dependencies manually is time-consuming. We recommend using Grafana’s Repeat panel feature.
For this you first need to define a variable for your keys/names:
You can then select a single value for the key/name from the dropdown list, create one panel, and use the
Repeating option under the
General settings of your panel.
Once you click on the
All option Grafana will render multiple panels, one for each dependency for you.
We encourage our teams to create their dashboards inside a folder and to use the same folder at least for dashboards related to the same service. The name of the folder could match the one of the project, or reflect the pillar, family name, etc.
The taxonomy depends on many factors, including the structure of a company but the following categories are usually company-agnostic:
- Business area (for us that would be “search”, “lodging”, etc.)
- Family or tech pillar name
- Technology name (e.g. micrometer, dropwizard, elasticache)
- Service/Infrastructure name
Recent Grafana versions support Dashboard Links, Panel Links, and Data Links. These could be either links to other dashboards or links to useful URLs. They rely on tags and once links have been created they will be available on your dashboard’s page.
On top of Dashboard Links we suggest using Panel Links. These could be links to monitoring systems used for logging (e.g. Splunk) or distributed tracing (e.g. Haystack) and redirect to a particular search associated with the service and the panel.
Templates is another key feature of Grafana which allows you to avoid duplication by using variables instead of hard-coded values. We have seen that feature in the queries we used earlier for Hystrix and Resilience4J metrics. You can define variables for the datasource, the application name, the Kubernetes pods, or even the percentiles you want to plot metrics for. The values of these variables will show up as dropdown lists, and you can use the selected values in your queries.
Last but not least, annotations enable you to mark points with events. This is handy for correlating metrics with events such as deployments or A/B tests and we highly recommend using it.
In this article we went through best practices on monitoring metrics and dashboards and showed you how to create Grafana dashboards based on Prometheus metrics. These examples can be used as a starting point to craft more complex queries and more visualisations. However, always keep in mind that less is more, and simple is better than complex!