EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Creating Monitoring Dashboards

Guidelines for developers

Nikos Katirtzis
Expedia Group Technology

--

A screenshot of a dashboard
Figure 1: Example of a Single Pane of Glass Monitoring Dashboard.

Recently our teams at Hotels.com™, part of Expedia Group™, started moving from Graphite to an internal metrics platform that is based on Prometheus. We saw this as an opportunity to improve our observability and, among others, we provided a set of simple guidelines to help with the migration.

We believe these guidelines would be useful to the community and hence we share them in this blog post. Some of the examples apply to our tech stack (i.e. Spring Boot, Micrometer, Kubernetes) but the idea is the same for other technologies and libraries.

Purpose of this Guide

Having meaningful and carefully crafted monitoring dashboards for your services is of utmost importance. The purpose of this guide is to:

  • Provide you with a set of handful resources around monitoring
  • Promote best practices on monitoring metrics and dashboards
  • Help you create Grafana dashboards based on Prometheus metrics

If you want to learn more about monitoring and best practices, we suggest you to read the following resources by Google:

Site Reliability Engineering, How Google runs production systems (Chapter 6 — Monitoring Distributed Systems)

The Site Reliability Workbook, Practical ways to implement SRE (Chapter 4 — Monitoring)

Principles

Below is a non-exhaustive list of principles to have in mind in the context of observability which also apply to dashboards:

  • Keep it simple, avoid creating complex dashboards that you will never use or alerts that can trigger false-positive notifications.
  • Keep it consistent, use consistent and meaningful names in your dashboards and alerts.
  • Use logs, metrics, and traces wisely and in conjunction with each other.
  • Avoid high-cardinality metrics.
  • Avoid complex and slow queries in your dashboards.

What to Monitor

Core Metrics

As a first set of metrics, you should be looking into monitoring the 4 golden signals as these are defined by Google, or follow the RED method which is more relevant to micro-services.

Latency (Duration)

This could take the form of percentiles (e.g. p90, p99). Be aware of failed requests which would result in misleading calculations.

Traffic (Rate)

An example of this would be the number of requests per second (RPS).

Errors

This will depend on what you consider as an error for your service or system. A typical metric could be the rate of non-2XX status code responses.

Saturation

Saturation shows how overloaded your service or system is. This could be monitoring the number of elements in a queue. You may also want to look into utilisation which reflects how busy the service is. An example of that is monitoring the busy threads.

Business Metrics

Ideally, you need to discuss and decide on this set of metrics with your product owner as they are based on business needs. Business metrics could be custom metrics reported by one or more services.

Indicative examples are listed below:

  • A team responsible for sign-ins would need to report metrics for sign-in attempts, failed attempts due to invalid passwords, or even sign-ins coming from different channels but still hitting the same endpoint.
  • A team owning the autocomplete functionality across multiple brands would need to monitor the number of requests and error rates per brand.

Dependencies Metrics

In a micro-services architecture, there could be many external calls from your service to other services. These calls are usually wrapped with Hystrix or other Circuit Breaker libraries. Monitoring core metrics (traffic, latencies, errors) for these calls is very important.

Connection Pools & Thread Pools Metrics

Having a dashboard that displays metrics for Tomcat threads, Circuit Breaker thread pools and HTTP client connection pools for 3rd party calls is useful.

JVM Metrics

Useful metrics for JVM applications include memory and CPU, GC, or even memory pools. We suggest re-using the JVM (Micrometer) Grafana dashboard.

Infrastructure Metrics

Many services rely on infrastructures such as a cache, a database, or a queue. Even if your team does not own these components, monitoring them can help you identify the root cause of an issue. Although the 4 golden signals apply to most infrastructure systems, these systems can also have extra characteristics you need to monitor (e.g. the size of the queue or cache hits/misses).

Platform Metrics

In addition to infrastructure metrics you may need to monitor Platform metrics, such as ones provided by Kubernetes or by the Service Mesh (e.g. Istio). Usually incident response and SRE teams look into such dashboards to have the big picture and to achieve faster Mean Time to Detect (MTTD) and Mean Time To Recover (MTTR).

Prometheus Best Practices

The open-source community has come up with a set of best practices on metric names and labels which we encourage you to follow.

Be super careful with high-cardinality metrics. As stated in the docs:

Every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

Popular metrics libraries may have mechanisms in place to prevent this issue. For example, Micrometer provides the maximumAllowableTags method through its Meter Filters. Recent versions of Spring Boot Actuator use this by default for URI tags; they expose the management.metrics.web.client.max-uri-tags property with a default value of 100 (you may need to decrease that value though). If your library doesn't provide this out-of-the-box you will need to implement this logic.

Let’s now look at practical examples you can re-use. Before we dive deep into queries, understanding the Prometheus format is crucial.

Understanding the Prometheus Format

If you hit the /prometheus endpoint under which your application exposes Prometheus metrics, you will see a set of metrics:

Figure 2: Prometheus metrics exposed by Spring Boot Actuator.

Taking the last two lines as an example, the name of the metric is http_server_requests_seconds_count and they both contain a set of labels such as the application name app, the endpoint uri, etc. In this case, the only difference is the client.

This is a representation of a single metric across multiple dimensions, by using labels. Having these multiple dimensions allows us to run powerful queries that could span across multiple URLs, AWS regions, and even across different applications.

Queries

Now that we have a basic understanding of the metrics format we can look into useful queries. This section includes very basic examples but you can use them as a starting point.

Rate

RPS — Overall

The following query shows the Requests Per Second (RPS) across all endpoints:

Figure 3: Query for Requests Per Second across all endpoints.
  • http_server_requests_seconds_count stores the count of HTTP requests.
  • app is a label that reflects the name of the application. You can use a regex and the '=~' operator for a set of applications.
  • We append the time selector [1m] which translates the instant vector into a range vector (over the last minute).
  • Up to this point, we have a range vector which we need to transform into an instant vector in order for it to be displayed. We do this by applying the rate function which shows per second increase.
  • Finally, we aggregate the results using the sum aggregation operator.
A dashboard pane showing rate requests per second
Figure 4: Visualising the overall RPS.

If you want to display a single number, you can use the Singlestat visualisation (or the Stat panel in recent versions of Grafana).

Screenshot showing singlestat visualisation
Figure 5: Singlestat visualisation of the overall RPS.

RPS — Aggregations

Often you need to aggregate results per label. For example, plot the RPS per Kubernetes pod, per endpoint, or even per client.

To show the RPS per pod:

Figure 6: Query for Requests Per Second by Kubernetes pod.
A screenshot of RPS by Kubernetes pod
Figure 7: Visualising the RPS by Kubernetes pod.

For the RPS per endpoint and client you can use the uri and client labels respectively. In these cases, as mentioned earlier in this guide, you need to be mindful of high-cardinality issues.

Duration

To show the latency (e.g. p99) per endpoint you can use the following query:

Figure 8: Query for the duration/latency per uri.
  • http_server_requests_seconds stores the latency of HTTP requests.
  • quantile=0.99 gives the p99. You can read more about quantiles.
  • Finally, we aggregate the results per endpoint using the max aggregator.
A screenshot visualising the duration/latency per uri
Figure 9: Visualising the duration/latency per uri.

Note that you can calculate quantiles from both histograms and summaries.

Quantiles may not be reported by default by your application. For example, for Spring Boot applications you need to set the management.metrics.distribution.percentiles for this. We recommend reporting the p50, p75, p90, p95, p99, p999 percentiles and defining a variable for these in your dashboards.

If you want to include or exclude particular endpoints you can do this with the uri label. For instance uri=~"/api/v1/.*" will only plot endpoints under the /api/v1/ path, while uri!~"/swagger.*" will exclude the Swagger endpoints.

Failed requests are not representative examples of latency as they could fail fast (e.g. 500) or take a lot of time to complete if a timeout is not in place or is mis-configured. We recommend visualising latencies for successful requests and, if needed, having another panel for tracking latencies for failed requests.

Errors

The simplest way to visualise your errors is by using a Stat panel, similar to Figure 5.

Figure 10: Query for error rates (5XXs).

Success Rates

A more descriptive way would be to visualise success rates per endpoint:

Figure 11: Query for success rates (200s).
A screenshot visualising success rates (200s) per uri
Figure 12: Visualising success rates (200s) per uri.

Dependencies

It is important to be able to identify issues with your dependencies. The same signals can be used to monitor such calls. You can get these metrics from Circuit Breaker libraries such as Hystrix or Resilience4J.

To check which metrics are exposed by your Circuit Breaker you can either go through the documentation or hit your /prometheus endpoint.

Hystrix uses keys (in particular command keys and command group keys) to identify and group commands. These are available as key and group labels when using the Hystrix metrics publisher. The result of the call is stored in the event label.

Resilience4J exposes the name, state, and kind labels as documented. The name is used to identify the call while the kind is the result.

RPS

The following queries return the RPS per Kubernetes pod for a selected key/name:

Hystrix

Figure 13: Query for the RPS per pod for a selected Hystrix key.

Resilience4J

Figure 14: Query for the RPS per pod for a selected Resilience4J name.

Latency

To plot the latency for a selected quantile (e.g. 0.99):

Hystrix

Figure 15: Query for the latency of a Hystrix call for a selected quantile.

Resilience4J

Figure 16: Query for the latency of a Resilience4J call for a selected quantile.

Errors

Finally, for errors:

Hystrix

Figure 17: Query for the error rate of a selected Hystrix key.

Resilience4J

Figure 18: Query for the error rate of a selected Resilience4J name.

Building panels for dependencies manually is time-consuming. We recommend using Grafana’s Repeat panel feature.

For this you first need to define a variable for your keys/names:

A screenshot showing how to defe a variable for your Hystrix keys
Figure 19: Defining a variable for your Hystrix keys.

You can then select a single value for the key/name from the dropdown list, create one panel, and use the Repeating option under the General settings of your panel.

Once you click on the All option Grafana will render multiple panels, one for each dependency for you.

Dashboard Guidelines

We encourage our teams to create their dashboards inside a folder and to use the same folder at least for dashboards related to the same service. The name of the folder could match the one of the project, or reflect the pillar, family name, etc.

We also strongly recommend the use of tags. Tags are helpful when searching for dashboards and allow you to add links to other dashboards or URLs.

The taxonomy depends on many factors, including the structure of a company but the following categories are usually company-agnostic:

  • Business area (for us that would be “search”, “lodging”, etc.)
  • Family or tech pillar name
  • Technology name (e.g. micrometer, dropwizard, elasticache)
  • Service/Infrastructure name

Recent Grafana versions support Dashboard Links, Panel Links, and Data Links. These could be either links to other dashboards or links to useful URLs. They rely on tags and once links have been created they will be available on your dashboard’s page.

A screenshot showing dashboard Links to other dashboards and external monitoring systems
Figure 20: Dashboard Links to other dashboards and external monitoring systems.

On top of Dashboard Links we suggest using Panel Links. These could be links to monitoring systems used for logging (e.g. Splunk) or distributed tracing (e.g. Haystack) and redirect to a particular search associated with the service and the panel.

A screenshot with Panel Links to other dashboards and external monitoring systems
Figure 21: Panel Links to other dashboards and external monitoring systems.

Templates is another key feature of Grafana which allows you to avoid duplication by using variables instead of hard-coded values. We have seen that feature in the queries we used earlier for Hystrix and Resilience4J metrics. You can define variables for the datasource, the application name, the Kubernetes pods, or even the percentiles you want to plot metrics for. The values of these variables will show up as dropdown lists, and you can use the selected values in your queries.

Last but not least, annotations enable you to mark points with events. This is handy for correlating metrics with events such as deployments or A/B tests and we highly recommend using it.

Conclusion

In this article we went through best practices on monitoring metrics and dashboards and showed you how to create Grafana dashboards based on Prometheus metrics. These examples can be used as a starting point to craft more complex queries and more visualisations. However, always keep in mind that less is more, and simple is better than complex!

Note: Thanks to Vinod Canumalla and Fabian Piau for reviewing the blogpost.

Learn more about technology at Expedia Group

--

--