High-Cardinality Monitoring for Microservices and Containers

Published in

signalfx

7 min readNov 2, 2018

As organizations of all sizes continue on the journey to cloud-native, the volume of operational data they must deal with has expanded significantly. Metrics from cloud infrastructure, open-source software, and ephemeral application components have not just increased datapoint volume — they have also added to the cardinality of metrics that monitoring systems store and query against.

In this post, we’ll review the concept of cardinality and explore several aspects of SignalFx that address challenges with handling high cardinality metrics, while keeping our data model rich and flexible.

What Does Cardinality Mean in Monitoring Systems?

Cardinality in the context of monitoring systems is defined as the number of unique metric time series stored in your monitoring system’s time series database (TSDB). Generally, a metric time series (MTS) is the unique combination of a metric name and any number of key-value pairs known as dimensions or labels. Below is an example of how metrics are represented in SignalFx:

{2017-08-21_17:32:00} {metric: cpu.utilization} {dimensions [server:sjc001] [service:login]} {value:80}

Dimensions are useful in monitoring because the additional context they provide enables filtering of time series. To illustrate, imagine if you had three time series measuring cpu utilization:

{2017-08-21_17:32:00} {metric: cpu.utilization} {dimensions [server:sjc001] [service:login] } {value:80} - MTS 1
{2017-08-21_17:32:00} {metric: cpu.utilization} {dimensions [server:sjc001] [service:logout]} {value:90} - MTS 2
{2017-08-21_17:32:00} {metric: cpu.utilization} {dimensions [server:sjc001] } {value:70} - MTS 3

Running an unfiltered query against cpu.utilization without specifying a dimensional key-value pair runs the query against all three time series. However, if you were to filter this query down to service.log*, you would only need to run the query against only two of your time series.

To further aid grouping and filtering, some systems also allow you to assign tags (occasionally referred to as properties) to dimensions and metrics. Typically, the use of tags and dimensions in metrics-based monitoring systems enables more direct queries to the time series database. They also allow monitoring of dynamic populations and groups of assets (e.g. a group of hosts with a ‘production’ tag). As new hosts are added to the production cluster or removed, the monitoring system will automatically detect changes. Charts and alerts will automatically update based on the population being observed.

At SignalFx, tagging our AWS EC2 metrics allows us to group our compute instances by the microservice that they back:

Metrics with dimensions that have many different values are considered high cardinality because each individual combination of metric and dimension value is seen as a unique time series. For example, if you track mean request latency for a service using customer ID as a dimension, and you have 100 customer IDs, your monitoring system has to deal with 100 time series.

Why Should You Care?

As infrastructure and development practices have evolved, it’s now possible for an organization to have hundreds or thousands of different services, emitting upwards of millions of data points per second. Monitoring practices have also changed — instead of collecting only infrastructure metrics, people are instrumenting their application workflows to measure higher-level indicators of performance. They also monitor according to populations or groups (customers), as opposed to constantly looking at individual components (servers). This use of non-standard tags and dimensions to slice and dice data causes high cardinality.

What are some use cases that create high cardinality metrics? Below are a few examples.

Immutable Infrastructure

Your team practices ‘immutable infrastructure’, where infrastructure components are never modified after they’re deployed. As a result, the company’s entire container fleet is replaced for every code push — with new (but equivalent) containers provisioned using a common image that contains the relevant changes. Each new container is represented by a unique time series.

Phased Deployments

Your team wants to compare different code versions during a canary or blue/green deployment, so that changes can be automatically rolled back if performance is adversely affected.

Capacity Planning

Your company is trying to forecast infrastructure capacity needs, and requires historical data on resource utilization for each service to see trends over long periods of time.

Customer Experience

You have a set of APIs that you expose to users, and have to maintain SLAs to specific customers for the uptime and performance of those APIs. To make sure that the company is delivering on its promise, you monitor request rate, errors, and duration for each of your services, and track whether performance varies by user.

When you perform these kinds of tasks in any monitoring system, your time series footprint steadily grows — and the number of time series that you simultaneously query at any given moment grows as well. One of our customers practices immutable infrastructure as described in the first example, with approximately 1 million containers supporting their production services. This means that for every deployment, 1 million new MTS are created in SignalFx.

What Makes High Cardinality Hard to Manage?

High cardinality is hard to manage because it increases both the number of time series that need to be stored by your time series database, and the size of queries that have to be made to it on a regular basis. Queries to the database become more computationally expensive, because there are now more time series, and any significant event (e.g. a code push, burst of user traffic) will result in a flood of simultaneous writes to the database as well.

The documentation for many monitoring systems actually warns users not to send in dimensions with a high number of potential values, or to keep dimension values below a hard limit to avoid performance penalties.

For instance, the documentation for Prometheus cautions against “overusing” labels:

Datadog also encourages users to use less than 1,000 tag/dimension values per metric, specifically warning that exceeding this limit will incur performance penalties:

This problem isn’t unique to metrics-based monitoring systems. New Relic states that using its APM Agent to collect too many unique MTS will automatically trigger limits on how data appears in its user interface:

In all of these cases, a larger time series footprint leads to slower-loading charts, delayed alerts, and less reliable monitoring.

SignalFx is Built for High Cardinality Metrics

SignalFx allows you to query over 50,000 metric time series in a single job without incurring performance penalties.

Looking back at the examples above, this means that SignalFx lets you query:

25x more metric time series than New Relic
50x more metric time series than Datadog

The solution is designed around a number of features aimed at reducing computational overhead when it comes to fetching data across high cardinality metrics, both in how we perform queries against metric time series and how we store them:

Separating Time Series Datapoint Storage from Metadata Storage

Because databases tend to be optimized for either storage or retrieval, or around handling a particular type of data, most TSDBs impose limitations on aggregation, cardinality, and querying to preserve stability at the expense of performance.

SignalFx was designed with two separate backends for storing metric time series — one specifically optimized to handle metric values, and one for human-readable metadata. Each backend is tailored to a specific use case (handling metadata is a search problem, while datapoint storage requires optimizing for bulk reads and writes), and scales independently of the other.

Currently, our time series store runs on Cassandra, while metadata in SignalFx is stored in Elasticsearch. We’ve spoken and written on several occasions about the engineering work that we’ve done to make each of these technologies perform in SignalFx, as well as how we scale and operate them on an ongoing basis:

Optimizing Query Performance

SignalFx does not treat ‘source’ or ‘host’ dimensions as unique. Other TSDBs require the user to specify a primary filter condition on source/host dimensions — any further filtering by tags etc. happens on the result of that primary filter. This makes queries highly inefficient. Since there is no condition on source/host, searching just by a tag requires those systems to scan through all time series to find which ones are a match.

SignalFx treats all dimensions and tags the same, which means any search by any combination of dimensions is equally efficient and fast. This improves query response times and usability for environments with high cardinality metrics.

High Cardinality Metrics in Real-Time, at Any Scale

High cardinality metrics are becoming increasingly common as people move away from monitoring that purely focuses on system metrics, and towards higher-level indicators of customer experience and application health.

Using them allows you to make your data more actionable and unlock valuable insights, and SignalFx allows you to easily query over tens of thousands of metric time series. Because of this, we can provide richer charts and more analytics-driven alerting, helping you better understand the behavior of your environment and notifying the right members of your team when issues emerge. In a real-time world, speed and scale make all the difference.

For in-depth guidance on naming metrics and dimensions in SignalFx, see our documentation on naming conventions. If you’re not using SignalFx yet, get started today with a 14-day trial.

Originally published at www.signalfx.com.