EXPEDIA GROUP TECHNOLOGY — DATA

Working around Datadog’s cardinality limitations

How to get custom metrics for ID-based entities while keeping your expenses under control

Lorenzo Dell'Arciprete
Expedia Group Technology

--

Datadog’s best practices discourage using IDs as tags for custom metrics. Doing so would cost you a lot of money. But what if you really, really need that?

Man standing on ice

Background

If you have read Datadog’s documentation about custom metrics, you will probably have seen this. Each tag combination you attach to a custom metric counts as its own metric, thus you have to carefully design which tags you pick. Datadog allows a limited amount of free custom metrics per host, and everything beyond that is pay-per-use.

What this means specifically, is that you are strongly discouraged from using any attribute with unbounded cardinality as tags, e.g. any kind of incremental ID.

Let’s consider, for example, the IDs of A/B experiments. What if we want to define a Datadog monitor based on a per-experiment metric? Thousands of experiments are run every year at Expedia Group™, and we have hundreds running at any given time. Collecting a custom metric and tagging it by experiment ID is out of the question, since that would be too expensive.

Publish only what you need

The approach to use depends heavily on the specific problem to solve, but the underlying principle is the same.

Coming back to the experiments example, the aim is to build a monitor that triggers whenever we are getting too many exposures for a single experiment. How do you measure that without keeping track of the amount of exposures each experiment is producing at any given time?

The first thing to consider is that we only care about data that overcomes the set threshold. If the experiment with the highest producing rate does not exceed the threshold, we can safely state that the monitor should not be triggered. This leads to the key point of our approach: instead of tracking metrics for all the experiments, we can track one single aggregate metric that reports the current highest producing rate among all experiments.

Building a monitor on that single metric is good enough to alert us about unusual activity.

What if you want to know which experiment is the culprit? What if you want to know whether there is more than one exceeding the threshold? Those are good questions, and we will come back to them later. But first, some details on the design.

Manipulating metrics

Using one single metric whose value may represent different underlying entities over time, may be tricky. Unless you want to start calculating real time derivatives, some approximation is due.

For the purpose of determining “the experiment with highest producing rate” at any given time, we split the timeline into discrete intervals. During each interval, we count the number of exposures produced per experiment, and once per interval we determine the most “chatty” experiment.

If that sounds like a problem best solved with metrics, that’s because it is! The fundamental difference is that we will not be publishing those metrics to Datadog. We will rather collect them internally, and report to Datadog one single metric that gets updated after each interval.

It is important to pick metric types carefully. Whilst the internal metrics are counters, you may want to use a different metric type for the aggregate metric that you publish to Datadog. Using a counter for that too might help you leverage Datadog’s features of metric manipulation, but you will face some strange looking data to work on. For example:

An ugly sawtooth graph

Instead, I picked a gauge for the aggregate metric. Note that the gauge has a different meaning: rather than measuring a raw count, it measures the current producing rate. It is actually the rate measured in the last completed interval, but that should be a good enough approximation as long as the intervals are short enough.

A segmented graph, not smooth but much better

When transforming the raw count into a rate, I recommend you pick the time unit according to the threshold you want to set, i.e. if you want to alert on a rate >10M per hour, it’s easier to publish your aggregate gauge as a rate per hour. If your intervals last 5 minutes, the gauge value should then be computed as the events count divided by 5/60, i.e. multiplied by 12. Then you can easily create a Datadog monitor that looks like this:

A monitor that compares the current rate with a threshold

Hands on with Spring Boot

How to implement all that depends on the specific technology stack, but let me briefly describe how I did it in a Spring Boot based Java application.

If you are publishing metrics to Datadog already, you should have something like this in your application’s configuration:

management:
metrics:
export:
statsd:
flavor: datadog
host: ${DD_AGENT_HOST}

For our purposes, we need to instantiate another MeterRegistry that we use exclusively for collecting data internally, without publishing to Datadog. Just declare a bean such as this:

@Bean
public SimpleMeterRegistry internalMeterRegistry() {
final SimpleConfig config = new SimpleConfig() {
@Override
public String get(String key) {
return null;
}
@Override
public Duration step() {
return Duration.ofMillis(5 * 60 * 1000);
}
@Override
public CountingMode mode() {
return CountingMode.STEP;
}
};
return new SimpleMeterRegistry(config, Clock.SYSTEM);
}

Note how this is configured to use the step-based counting mode. This means that any counter created by this meter registry will be a StepCounter. That's exactly what we need, as the docs state:

count() will report the number of events in the last complete interval rather than the total for the life of the process.

Now, rather than injecting a generic MeterRegistry, we should explicitly inject a SimpleMeterRegistry for internal metrics, or a StatsdMeterRegistry for metrics destined for Datadog.

Next we need to collect data in the internal counters, and schedule a process that looks at data from the last interval, and updates the published gauge:

@Service
@EnableScheduling
public class AggregateMetricsService {
private final MeterRegistry internalMeterRegistry;
private final AtomicLong topmostProducedRate = new AtomicLong();
private final Set<Long> publishingExperiments = new HashSet<>();
@Autowired
public AggregateMetricsService(SimpleMeterRegistry internalMeterRegistry, StatsdMeterRegistry publishingMeterRegistry) {
this.internalMeterRegistry = internalMeterRegistry;
publishingMeterRegistry.gauge("aggregateMetricName", topmostProducedRate);
}
// This should be invoked whenever an exposure is detected
public void mark(Long experimentId) {
internalCounterByExperiment(experimentId).increment();
publishingExperiments.add(experimentId);
}
@Scheduled(fixedRate = 5 * 60 * 1000)
public void publish() {
long topmostProducedCountInInterval = 0;
for (final Long experimentId : publishingExperiments) {
final double count = internalCounterByExperiment(experimentId).count();
if (count > topmostProducedCountInInterval) {
topmostProducedCountInInterval = count;
}
}
this.topmostProducedRate.set(topmostProducedCountInInterval * 12);
}
private Counter internalCounterByExperiment(Long experimentId) {
return internalMeterRegistry.counter("internalMetricName", "experimentIdTag", String.valueOf(experimentId));
}
}

Fine tuning

Size of the step interval

Imagine an experiment is having short lived spikes of exposures that would not cause the monitor to trigger when looking at a larger time frame. Now imagine a few experiments experience the same. There is a risk that the aggregate gauge flips between the different experiments, registering several spikes in the monitor interval. Unlikely as this scenario might be, it could lead to a false alarm. Picking a larger step interval helps to prevent this.

On the other hand, the aggregate gauge always reports data about the latest completed interval. This means that the data we see is delayed by one interval. The smaller the step interval is, the smaller the delay before we notice something is wrong.

You need to define a good compromise according to your use case.

Finding the culprit

Assume you set up everything, and your monitor gets triggered. What now? The first step is to identify which experiment crossed the threshold. We can run manual queries to find that out, but can we do it more conveniently? Yes, and no.

I tried to solve this problem by publishing an additional gauge, which is not really a metric, but rather the ID of the experiment that produced the most exposures in the last interval. Displaying this metric in a dashboard has the nice advantage of allowing you to quickly detect the scenario mentioned in the previous paragraph. If you see a broken line, you know that the most “chatty” experiment has been changing, so alarms might be false positives.

Unfortunately, this is not good enough for identifying the culprit at a glance. Datadog dashboards report rounded numbers, so that e.g. experiment ID 11092 would be displayed as “11.1k”. If you know how to force displaying the exact value, please leave a comment!

Clearly, this “trick” can only be used if you are handling numeric IDs. It does not apply if you have something like GUIDs.

But I want more!

As I mentioned at the beginning, we only cared about the experiment producing most exposures. We don’t expect multiple experiments crossing the threshold at the same time, and even if that happens, it’s not a big deal.

But maybe you expect your threshold to be crossed by multiple entities at the same time, and you want to have more details. Something you can do is to just apply the same pattern, but publishing metrics about the top 3 (or 5, or 10) entities. This might get messy, but should work well enough in most cases.

Learn more about technology at Expedia Group

--

--