Are Your Dropwizard Latency Metrics Misleading You?

How Rolling-Metrics and HdrHistogram Can Help

Published in

Expedia Group Technology

5 min readDec 5, 2017

If you use the Dropwizard metrics library, the chances are you’re using at least one timer or histogram to surface and report relevant latency information. For those not familiar, timers usefully combine a histogram of the event’s duration and a meter of its rate.

Internally, histograms use reservoirs to store a period of metric data upon which standard statistical information and percentiles are calculated. By default, a histogram uses an exponentially decaying reservoir (EDR), which:

…produces quantiles which are representative of (roughly) the last five minutes of data. It does so by using a forward-decaying priority reservoir with an exponential weighting towards newer data. Unlike the uniform reservoir, an exponentially decaying reservoir represents recent data, allowing you to know very quickly if the distribution of the data has changed.

While this sounds perfectly reasonable, there are several potential shortcomings:

EDRs are lossy by design; they don’t store every sample (they’re statistically representative).
EDRs by default store a static 1028 samples and samples are weighted towards the past 5 minutes.
The rate at which samples decay within EDRs is influenced by how frequently the histogram is updated.

These shortcomings combined mean your reported metrics could be misleading, either from inaccuracies resulting from discarded samples or measurements calculated through inclusion of potentially much older samples. It’s worth noting Dropwizard provides alternative reservoir types, however they’re typically not suitable for real-time reporting.

January 2019 update: Dropwizard 3.2.3 introduced SlidingTimeWindowArrayReservoir - a sliding window-based, lossless reservoir implementation with no loss of precision.

HdrHistogram (High Dynamic Range Histogram) is a lossless histogram implementation with configurable value precision that addresses the shortcomings of EDRs. HdrHistogram (HDRH) isn’t currently included as part of Dropwizard metrics, but thankfully there’s the very decent rolling-metrics library that integrates the two.

As part of an initiative to provide standardisation around metrics and reporting, Hotels.com’s Performance and Reliability Engineering Team performed a side-by-side comparison of EDR and HDRH (with rolling-metrics) by running several test scenarios. We built a Java-based test harness to simulate configurable millisecond-precision latencies to report (of Gaussian distribution) in order to compare the two. Here’s how the two histograms are configured:

HdrBuilder builder = new HdrBuilder();Reservoir reservoir = builder
    .withSignificantDigits(4)
    .withLowestDiscernibleValue(1)
    .withHighestTrackableValue(120000, OverflowResolver.REDUCE_TO_HIGHEST_TRACKABLE)
    .resetReservoirPeriodicallyByChunks(Duration.ofSeconds(15), 3)
    .buildReservoir();histogramHdr = new Histogram(reservoir);
registry.register("histogramHdr", histogramHdr);histogramEdr = registry.histogram("histogramEdr");

HDRH is configured with 4 significant digits, giving the following range precision:

0ms-9,999ms: 1ms resolution
10,000ms-99,999ms: 10ms resolution
100,000ms-999,999ms: 100ms resolution

In order to report upon current data only, we also opt to use a sliding window-based resetting mechanism which is composed of a configurable number of chunks. The parameters of this strategy should be changed according to reporting frequency and sensitivity requirements. See here for more information on resetting strategies.

It’s worth noting that Dropwizard timers use nanosecond-precision internally by default, which is not suitable for HDRH as only 5 significant digits are permitted. Unless you provide your own millisecond-precision timer, you’ll need to directly create histograms and meters independently.

We ran 4 scenarios to highlight the shortcomings of EDR described earlier in this post and show how HDRH overcomes them.

Scenario 1 : Decay

Purpose: illustrate EDR’s decay effect in percentiles.
Constant 50 histogram updates/s.
Phase 1 (60s): latency distribution parameters: 350ms mean, 75ms standard deviation.
Phase 2 (60s): latency distribution parameters: 850ms mean, 75ms standard deviation.
Phase 3 (5m): latency distribution parameters per phase 1.
Conclusion: EDR’s decay leads to misleading long-lasting reported measurements, which worsens with higher percentiles. HDRH quickly reflects based on its 15s window and reacts quicker with lower percentiles.

Scenario 2: Spikes

Purpose: illustrate EDR’s lack of sensitivity in percentiles when faced with latency spikes.
Constant 50 histogram updates/s.
Phase 1 (60s): latency distribution parameters: 350ms mean, 75ms standard deviation.
Phase 2 (1s): latency distribution parameters: 850ms mean, 75ms standard deviation.
Phase 3 (30s): latency distribution parameters per phase 1.
Phases 2 and 3 repeated a further 4 times.
Phase 4 (5m): latency distribution parameters per phase 1.
Conclusion: EDR registers spikes on 75th percentile but are indistinguishable/misleading in higher percentiles due to decay. EDR reports less accurate measurements in some instances due to lossy nature.

Scenario 3: Linger

Purpose: illustrate EDR’s longer decay effect as histogram is updated less frequently.
Phase 1 (60s): 50 histogram updates/s, latency distribution parameters: 850ms mean, 75ms standard deviation.
Phase 2 (10m): 2 histogram updates/s, latency distribution parameters: 350ms mean, 75ms standard deviation.
Conclusion (reduced to 2 graphs as you probably get the picture now!): EDR’s decay effect is more pronounced and in the case of the 99th percentile, it takes over 5 minutes to report a value similar to HDRH.

Scenario 4: Lost Outliers

Purpose: illustrate EDR’s ability to throw away samples and miss them from reporting. For this test, we inject single min/max outliers and determine whether these are evident in the reported min/max statistics.
Constant 100 histogram updates/s.
Phase 1 (60s): latency distribution parameters: 850ms mean, 75ms standard deviation.
Phase 2: inject 60,000ms outlier and wait 30 seconds.
Phase 3: inject 1ms outlier and wait 30 seconds.
Phases 2 and 3 repeated a further 4 times.
Phase 4 (5m): latency distribution parameters per phase 1.
Conclusion: EDR occasionally fails to report outliers; when they are, the decay effect means they are occasionally reported for longer periods. HDRH always reports them correctly. NB this is intended to illustrate samples can go missing due to EDR’s lossy nature but does not necessarily mean it will happen in all circumstances — see here for more discussion from Gil Tene on this.

One word of warning — the required JVM heap footprint for a HDRH can be much larger than that of the fixed requirement of an EDR (depending on configuration), so test carefully before performing a mass change of reservoir implementations. On the flip side, our tests have shown that JVM object allocation rates and CPU utilisation are both lower in the case of HDRH.

Hopefully this post goes some way in demonstrating how your EDR-based metrics could be misleading and that HDRH (in conjunction with rolling-metrics) can provide you with greater accuracy and clarity on your metrics.