SRE: Observability: Metric Namespaces and Structures

Structured metric namespaces are important in order to have information quickly accessible during incidents. Careful consideration must be given to metric names and dimensions in order to support a wide variety of queries and expansions. One way I’ve found to be effective at modeling flexible metric models is thinking of them as a tree. Thinking of metrics as a tree supports a number of benefits including: viewing specific subsets of data, defining a metric in terms of its children and establishing ratios. This post explores properties of metric namespaces that enable answering progressively more specific questions and allow drill down to subsets of the data as well as viewing a metric in terms of the metrics that its composed of. Many of these concepts will be familiar as they are first class ideas of cloud native monitoring solutions such as Prometheus and DogStatsD.

What

Metric spaces are the conceptual space where metrics live. They are often bounded within a database or an account:

The metric space is not only where metrics live but also encompasses structures within a metric space. Correct naming and structure unlocks a number of huge benefits. The metric namespace in the diagram above has no explicit structure. Each metric is floating and space. They don’t share anything other than the fact that they exist in the same metric space. In this structureless setup each metric has to be used individually. In order to see the rate of http requests for a service it needs to be accessed directly service_N_http_requests_total.

Suppose that it’s useful to see the total number of requests being made across all instances of service. In the example above what happens if a new service is created?

If the sum is being calculated from service_1 and service_2 when service_3 is added, there is nothing to connect them; there’s no structure to the metrics. The request count doesn’t reflect the actual total request count until service_3_http_requests_total is manually added to the sum. This can be seen in the chart below:

Metric Trees

One alternative to a structureless space is to adopt an explicit structure using the metric as a namespace. The following chart illustrates this structure as a tree:

In Prometheus and Datadog metric structures are created using labels and tags, respectively. Using tags, expanding the total rate is dynamic; whenever a new service is added, it has a reference through the tree to the main metric.

In prometheus, getting the per second rate dynamically across all services can be seen below:

With a structured namespace it is possible to dynamically calculate the sum across a root node (in this case it’s calculating the rate of each individual service as a single metric).

Defining a metric in terms of its children

Using a tree, each dimension of a metric (ie “service” label) contains that node’s individual request rate. Using a metric namespace, not only is the total rate now available but the rate per individual service instance can also be visualized:

By modeling the data explicitly using a namespace, it allows to see the composite view along selected dimensions of what composes the parent data.

Increasingly Specific — Subsets of data

Namespaces also support narrowing down on specific subsets of data. Trees support answering questions like: What’s the p99 latency on all successful http requests for canary deploy machines?

The tree above models the conceptual space and doesn’t necessarily model how the metrics are stored on disk. Using a well defined metric namespace allows expansion along any dimension.

The graph below shows graphing the p99 and p50 from the tree above:

Combined with the technique above this can get more specific along any dimension of the metric: What’s the p99 latency on all successful http requests for canary deploy machines per machine?

The following shows prometheus’ support for expanding this metric by machine_id:

Since the metric has a defined structure the top level metric is able to be expanded dynamically by machine_id

Ratios Rule

A ratio is another way to create a link of data (correlate) within an unstructured space. It’s extremely powerful and is the basis for SLO availability and error rate calculations popularized by google SRE. Ratios allow end user to explicitly link metrics, establishing a metric structure. These links are most often expressed as percentages, ie availability might be calculated as successful requests / total requests and error rate as error requests / total requests. Other important ratios are how much one state of multiple states occurs.

To illustrate this, suppose there was an application that performed a request and could result in one of multiple states, ie cache_hit:true vs cache_miss:false. In order to see cache hit ratio the data needs to be structured in a way to answer:

The graph below shows an example rate of cache hits and misses. Every other request is either a cache hit or a miss. and there are roughly 160 requests per second:

The following chart correlates the cache hit rate to the total request rate, showing a 50% cache hit rate:

This creates a logical link this isn’t a directed or concrete relationship, in datadog and prometheus its expressed as just an arithmetic operation. Since this occurs at the query level any two metrics could be correlated.

It’s all in the questions

To help guide think of the questions that data should be able to answer? In the very first example the data can’t accurately answer: “How many requests per second are all instances handling”, but the namespace tree could. Another frequently seen approach is name-spacing client metrics with the name of the consuming service and not by the client library itself. Namespacing by the client library supports answering: What are the total number of requests that all clients are generating?

General Questions that I’ve found helpful follow Google’s four golden signals. Each question starts out broad and then traverse along the dimensions:

  • How many requests are all clients making in total?
  • How many requests is each client making?
  • How many requests is each client making per machine?
  • What is the rate of successful requests over machine, per RPC?

The same strategy applies for latencies, error rates, and resource saturation.

Generic Metrics Enriched With Tags

From what I’ve read datadog and prometheus general guide optimal querying and storage through their promoted best practices. In order to achieve a global view that supports drill down into specific segments start with generic top namespace and enrich with tags and labels (Start broad then refine by specific dimensions) but…

Beware of cardinality

Both datadog and prometheus have recommended limits on cardinality. Quoting prometheus directly:

Do not overuse labels
Each labelset is an additional time series that has RAM, CPU, disk, and network costs. Usually the overhead is negligible, but in scenarios with lots of metrics and hundreds of labelsets across hundreds of servers, this can add up quickly.
As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.
If you have a metric that has a cardinality over 100 or the potential to grow that large, investigate alternate solutions such as reducing the number of dimensions or moving the analysis away from monitoring and to a general-purpose processing system.
To give you a better idea of the underlying numbers, let’s look at node_exporter. node_exporter exposes metrics for every mounted filesystem. Every node will have in the tens of timeseries for, say, node_filesystem_avail. If you have 10,000 nodes, you will end up with roughly 100,000 timeseries for node_filesystem_avail, which is fine for Prometheus to handle.
If you were to now add quota per user, you would quickly reach a double digit number of millions with 10,000 users on 10,000 nodes. This is too much for the current implementation of Prometheus. Even with smaller numbers, there’s an opportunity cost as you can’t have other, potentially more useful metrics on this machine any more.
If you are unsure, start with no labels and add more labels over time as concrete use cases arise.

Observability at the user level is often better accomplished through distributed tracing. Which has its own metric space and best practices.

Conclusion

It’s important to consider which questions can be answered when structuring a metrics space. The wrong structure may prevent some questions from being answered easily, or at all. While structuring a metric space isn’t difficult it does require up front planning in order to get the most out of data. In on call situations it’s sometimes crucial to be able to expand a metric arm in order to see him all the states have a given all states or dimensions it’s important that the metric namespace doesn’t prevent that from happening.

References