Effectively measuring execution times with Micrometer & DataDog

Joaquín Martín
Clarity AI Tech
Published in
7 min readJan 11, 2023
Photo by Tsvetoslav Hristov on Unsplash

Observability is one of the key points that we as software engineers must have in mind, without a good observability we are blind. There are a lot of tools with almost unlimited possibilities but in this post we want to share our experience and what we have learned when collecting execution time metrics.

At Clarity AI, we started to worry about gathering timing metrics from specific methods when we needed to refactor a shared library in our code. This library executes a complex algorithm with a high volume of data. Since this is a critical part of our product, we wanted to have the most accurate data possible to be sure that we were improving the performance.

Infrastructure overview

We use Micrometer to collect the metrics and DataDog as Cloud Monitoring Service. So lets see a quick overview of how the metrics are collected and published into DataDog.

From DataDog documentation

On one side we have our applications that use Micrometer to gather and expose metrics. On the other side DataDog Agent works as an intermediary between our applications and DataDog. Considering this infrastructure, this is a quick summary of how the metrics are gathered:

  1. Metrics are gathered and processed locally by Micrometer.
  2. Micrometer publishes the data using the DogStatsD format (a statsd flavor defined by DataDog).
  3. DogStatsD receives the metrics, also applies its own process when necessary like data aggregation, percentile calculations, etc.
  4. Metrics are published into DataDog, so later we can query them.

During this post we are going to assume that all the DataDog specific infrastructure set up is already in place. We will not enter into the details of things like DataDog agent configuration.

Setting up Micrometer

Configuring Micrometer to connect with DataDog when using SpringBoot is really easy, you just need to add micrometer-registry-statsd maven dependency. When this dependency is present, SpringBoot will pick it up and everything will work out of the box. Note that there is a SpringBoot setting to specify the statsd flavor, but it defaults to Datadog flavor.

<dependency>java
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-statsd</artifactId>
<version>${micrometer.version}</version>
</dependency>

Just adding this, dependency metrics collected by Micrometer will be published into DogStatsD. But for timer metrics, although you can use Micrometer straight away, we recommend configuring the TimedAspect with Spring AOP, so you can use the @Timed annotation.

@Configuration
public class TimedConfiguration {
@Bean
public TimedAspect timedAspect(MeterRegistry registry) {
return new TimedAspect(registry);
}
}

That’s all, now we are ready to use all the Micrometer’s timer capabilities.

Gathering metrics in the simple way

The first and easiest way is just to tell Micrometer that you want to gather metrics for a specific metric. You can use just the @Timed annotation for that:

@Timed("example.metric.name")
public void socialImpactToMarkets() { ... }

Using this way, Micrometer will gather metrics for every execution of this method and later publish them so DogStatdsD can get this information. Once DogStatsD have this data, it will aggregate the data in a 10 seconds bucket and then pushes the metrics to DataDog. Then we will have the metrics available in DataDog to query it.

Once those metrics reach DataDog we can see it as a histogram. You can see a Datadog histogram as a group of metrics composed of the following:

example_metric.histogram.count	        Number of times this metric was sampled during the interval
example_metric.histogram.avg Average of the sampled values
example_metric.histogram.median Median sampled value
example_metric.histogram.max Maximum sampled value
example_metric.histogram.95percentile 95th percentile sampled value

The key point here is that the metrics are aggregated in buckets of time in DogStatsD side, so wherever metrics are processed by DogStatsD in that slot of time will be aggregated before it is sent to DataDog. This means that in DataDog we can’t query metrics for custom slots of time. We can do an average, median, addition or other operations but always using the data that was previously aggregated. If you come from Prometheus background, this is something similar to the scrape interval, but in a push based metric system.

For example if you want to get the 95th percentile of the last week, you can do an average of all the 95th metrics for every bucket of time of that week. But this will be an approximation no the real value. Not being the exact value doesn’t mean it can’t be helpful, it could do the job to compare how the execution times are evolving during the time.

Calculating percentiles locally

There is a similar option of getting percentiles metrics by aggregating the data locally in the application with Micrometer. We can do this by setting up percentiles parameter in the @Timed annotation:

@Timed(value = "example.metric.name", percentiles = { 0.95, 0.75 })
public void socialImpactToMarkets() { ... }

When using percentiles, Micrometer will calculate the given percentiles locally before publishing the messaging to statsd server. By default, it aggregates the data during a default period of time, one minute at the moment of writing this post (check the source code to see the truth).

The result on DataDog will be the same histogram we saw before, plus a new gauge metric for the percentiles calculated by Micrometer. This metric will have a tag phi to filter by the percentile values. In our example this new metric will be named example.metric.name.percentile with tag values phi:0.95 and phi:0.75. Note that Micrometer is only aggregating the data locally for the specified percentiles, it will continue pushing the rest of the metrics like in the method before.

This is similar what DogStatsD does on its side but it allows you to have more control on what percentiles to use and the time bucket size. It also enables you to configure specific percentiles for each method we are measuring. Since this is done locally in the app it can add little extra load on your CPU and Memory consumption. For the majority of cases it shouldn’t be a problem but if you are worried about that I encourage you to review the Micrometer documentation.

Using DataDog distributions

The last way can really change the game. Enabling the histogram feature changes how the metrics are sent to DataDog, let's take a look:

@Timed(value = "example.metric.name", histogram = true)
public void socialImpactToMarkets() { ... }

In this case Micrometer will publish the metrics as Distribution type, this is a specific DogStatsD metric type. From Datadog documentation:

The DISTRIBUTION metric submission type represents the global statistical distribution of a set of values calculated across your entire distributed infrastructure in one time interval.

That means no aggregation at all in Micrometer neither on the DogStatsD side. This will unleash the tools to make powerful queries on Datadog. When a distribution is on DataDog we can apply a series of aggregation functions: count, sum, min, max, avg, p50, p75, p90, p95, and p99. This is where the power of the Distributions resides, it allows you to aggregate metrics in the best way that suits every case. But also it will cause more data stored and indexes on DataDog, and in the cloud world, that means that it will probably have a monetary cost.

Advantages of distributions metrics

These kind of metrics are very powerful on DataDog, we can use this metric to calculate percentiles and different values for any given time range. For example using Distributions we could do things like:

  • Get the 95th percentile for the example.metric.name in the last 2 weeks.
  • Raise an alert if the 90th percentile is greater than a 300 ms for the last hour.
  • Know how many executions took more than 500 ms.

Using distributions will allow us to make queries using the raw data therefore giving us metric values that are closest to the reality. This is something that we can’t achieve with other methods explained above. With previous methods we could get the several values of 95th percentile for fixed buckets of times and apply some operations like average, median, addition or similar to get an approximation.

Decisions on metrics details like percentiles to use or aggregation times are postponed until we query the metrics instead of taking those decisions in advance when writing code.

Furthermore the different widgets that DataDog provide can take advantage of those queries, allowing you to fine tune your dashboards. It is remarkable the distribution widget that can only be used with distribution metrics.

Distribution widget with the different percentiles

Enabling advanced query functionality

To be able to make advanced queries on distributions metrics in DataDog it’s necessary to enable it for every metric.

With great power comes great responsibility! Enabling this feature will add indexes in DataDog and that could have a monetary cost. As usual in the cloud world you should review your DataDog plan before starting to enable this feature widely in your system.

--

--