Using Go 1.13’s new ReportMetric API to benchmark t-digest implementations

5 min readSep 11, 2019

Histograms are an important part of metrics and monitoring. Unfortunately, it is computationally expensive to accurately calculate a histogram or quantile value. A workaround is to use an approximation and a great one to use is t-digest. Unfortunately again, there’s no such thing as an accurate approximation. There are various t-digest implementations in Go but they do things slightly differently and there’s no way to track how accurate their approximations are over time and no great way for consumers of quantile approximation libraries to benchmark them against each other.

Using Benchmark.ReportMetric

Luckily, Go 1.13’s new ReportMetric API allows us to write traditional Go benchmarks against custom metrics, such as quantile accuracy. The API is very simple and pretty intuitive, taking a value and unit.

func (b *B) ReportMetric(n float64, unit string)

Values that scale linearly per number of operations, like time to do something or memory allocation, should divide the reported number by b.N. Since correctness isn’t a value that scales linearly, I calculate it individually for a specific number of values and report each as individual benchmarks.

Creating benchmark dimensions

The core problem is “How correct are tdigest implementations in various scenarios”. To quantify this problem I need to break it down into a dimension set and report all data along each dimension.

Dimension set: Implementation

I benchmark the following t-digest implementations for correctness:

To create a dimension out of implementations, I convert each implementation to a base interface

The type digestRun will be my dimension for the digest I’m testing against.

Dimension set: numeric series

How quantiles perform depends a lot on the order and nature of the input data. This makes the number series another dimension of my benchmark.

I decided to benchmark against the following patterns of numbers

linearly growing sequence
random values
a repeating sequence
an exponentially distributed sequence
tail spikes: mostly small values and a few large ones

The type sourceRun becomes my dimension for the source of the data.

Dimension set: size and quantile

The last and simplest dimension sets are how many values I add before I evaluate correctness and which quantile I evaluate correctness against. These become my dimensions size and quantile.

Combining the dimensions together

From all of these dimensions I can use SubBenchmarks to create my final benchmark. Notice how each dimension becomes a subbenchmark.

As I add dimensions to my benchmarks, I create another b.Run. I use the pattern key=value as recommended on the benchmark data format proposal. At the most inner loop of my subbenchmarks, I compare the actual quantile result, calculated with allwaysCorrectQuantile , to the quantile result for the given digest. To hide ns/op from my benchmark results, I report the special value 0 for b.ReportMetric(0, “ns/op”).

Running the benchmarks

Since these are go benchmarks, I can run them just like normal Go benchmarks. One thing I do for my benchmark runs is filter out all the tests so just the benchmarks run. Here is what my Makefile has.

Here is what one line of the benchmark output looks like

BenchmarkCorrectness/size=1000000/source=exponential/digest=influxdata/quantile=0.999000-8          1000000000          0.118 %difference

This lines of benchmark output gives us a datapoint.

Value=.118 %difference
benchmark=BenchmarkCorrectness
size=1000000
source=exponential
digest=influxdata
quantile=0.999

Using benchdraw

There is a large amount of benchmark output. To visualize it I use benchdraw. Benchdraw is a simple CLI made for Go’s benchmark output format and allows drawing “good enough” graphs and bar charts from dimensional data.

Here is an example command of benchdraw that visualizes the ns/op graph below. This benchdraw command plots the source dimension on the X axis and ns/op on the Y axis, leaving the digest dimension as bars.

benchdraw --filter="BenchmarkTdigest_Add" --x=source < benchresult.txt > pics/add_nsop.svg

You can find lots more documentation on the benchdraw github page.

Correctness results

The correctness of an implementation’s approximation algorithm depends a lot on the nature of the data and the quantile we’re looking at. Plotting all the data at once, without axis modifiers, shows there are clear outliers in the segmentio and influxdata implementation that make it difficult to compare. For all correctness results, lower (a smaller % difference) is better.

benchdraw --filter="BenchmarkCorrectness/size=1000000" --x=digest --y=%difference < benchresult.txt > pics/correct_allquant.svg

We can filter this down to just the 90% quantile. Here we see the tailspike distribution causes problems for both the segmentio and influxdata implementation.

benchdraw --filter="BenchmarkCorrectness/size=1000000/quantile=0.900000" --x=source --y=%difference < benchresult.txt > pics/correct.svg

Things look a lot better on the 99th percentile (the Y axis goes from 120% to .6%), but segmentio and influxdata’s correctness implementation is still outperformed by caio’s.

benchdraw --filter="BenchmarkCorrectness/size=1000000/quantile=0.990000" --x=source --y=%difference < benchresult.txt > pics/correct_99.svg

We can break down the graphs per tdigest implementation to see how they perform at each quantile. We see segmentio’s tailspike distribution performs poorly on the 90th percentile.

benchdraw --filter="BenchmarkCorrectness/size=1000000/digest=segmentio" --v=3 --x=source --y=%difference < benchresult.txt > pics/correct_segment.svg

Influxdata's implementation has a much smaller Y axis (cap at 30%), but still performs worse than normal on the 90th percentile.

benchdraw --filter="BenchmarkCorrectness/size=1000000/digest=influxdata" --v=3 --x=source --y=%difference < benchresult.txt > pics/correct_influx.svg

Caio's implementation does a reasonably good job at all data sources and quantiles, with a Y axis that caps at .03%

benchdraw --filter="BenchmarkCorrectness/size=1000000/digest=caio" --v=3 --x=source --y=%difference < benchresult.txt > pics/correct_caio.svg

Since none of the implementation’s do particularly poorly on the exponential distribution, we can narrow down into just that series of data to get results that may be more typical. Here we see segmentio’s implementation has a much larger % difference than either of caio or influxdata’s.

benchdraw --filter="BenchmarkCorrectness/size=1000000/source=exponential" --v=3 --x=digest --y=%difference < benchresult.txt > pics/correct_exponential_all.svg

Looking at influxdata’s implementation we can see it perform progressively poorly as the quantiles increase.

benchdraw --filter="BenchmarkCorrectness/size=1000000/source=exponential/digest=influxdata" --v=3 --x=quantile --y=%difference < benchresult.txt > pics/correct_exponential_influxdb.svg

For caio’s implementation, it performs well across many quantiles.

benchdraw --filter="BenchmarkCorrectness/size=1000000/source=exponential/digest=caio" --v=3 --x=quantile --y=%difference < benchresult.txt > pics/correct_exponential_caio.svg