# Using Go 1.13’s new ReportMetric API to benchmark t-digest implementations

Histograms are an important part of metrics and monitoring. Unfortunately, it is computationally expensive to accurately calculate a histogram or quantile value. A workaround is to use an approximation and a great one to use is t-digest. Unfortunately again, there’s no such thing as an accurate approximation. There are various t-digest implementations in Go but they do things slightly differently and there’s no way to track how accurate their approximations are over time and no great way for consumers of quantile approximation libraries to benchmark them against each other.

# Using Benchmark.ReportMetric

Luckily, Go 1.13’s new ReportMetric API allows us to write traditional Go benchmarks against custom metrics, such as quantile accuracy. The API is very simple and pretty intuitive, taking a value and unit.

`func (b *B) ReportMetric(n float64, unit string)`

Values that scale linearly per number of operations, like time to do something or memory allocation, should divide the reported number by `b.N`

. Since correctness isn’t a value that scales linearly, I calculate it individually for a specific number of values and report each as individual benchmarks.

# Creating benchmark dimensions

The core problem is “How correct are tdigest implementations in various scenarios”. To quantify this problem I need to break it down into a dimension set and report all data along each dimension.

## Dimension set: Implementation

I benchmark the following t-digest implementations for correctness:

To create a dimension out of implementations, I convert each implementation to a base interface

The type `digestRun`

will be my dimension for the digest I’m testing against.

## Dimension set: numeric series

How quantiles perform depends a lot on the order and nature of the input data. This makes the number series another dimension of my benchmark.

I decided to benchmark against the following patterns of numbers

- linearly growing sequence
- random values
- a repeating sequence
- an exponentially distributed sequence
- tail spikes: mostly small values and a few large ones

The type `sourceRun`

becomes my dimension for the source of the data.

## Dimension set: size and quantile

The last and simplest dimension sets are how many values I add before I evaluate correctness and which quantile I evaluate correctness against. These become my dimensions `size `

and `quantile`

.

# Combining the dimensions together

From all of these dimensions I can use SubBenchmarks to create my final benchmark. Notice how each dimension becomes a subbenchmark.

As I add dimensions to my benchmarks, I create another `b.Run`

. I use the pattern `key=value`

as recommended on the benchmark data format proposal. At the most inner loop of my subbenchmarks, I compare the actual quantile result, calculated with `allwaysCorrectQuantile`

, to the quantile result for the given digest. To hide `ns/op`

from my benchmark results, I report the special value 0 for `b.ReportMetric(0, “ns/op”)`

.

# Running the benchmarks

Since these are go benchmarks, I can run them just like normal Go benchmarks. One thing I do for my benchmark runs is filter out all the tests so just the benchmarks run. Here is what my Makefile has.

Here is what one line of the benchmark output looks like

`BenchmarkCorrectness/size=1000000/source=exponential/digest=influxdata/quantile=0.999000-8 1000000000 0.118 %difference`

This lines of benchmark output gives us a datapoint.

- Value=.118 %difference
- benchmark=BenchmarkCorrectness
- size=1000000
- source=exponential
- digest=influxdata
- quantile=0.999

# Using benchdraw

There is a large amount of benchmark output. To visualize it I use `benchdraw`

. Benchdraw is a simple CLI made for Go’s benchmark output format and allows drawing “good enough” graphs and bar charts from dimensional data.

Here is an example command of benchdraw that visualizes the ns/op graph below. This benchdraw command plots the `source`

dimension on the X axis and `ns/op`

on the Y axis, leaving the `digest`

dimension as bars.

`benchdraw --filter="BenchmarkTdigest_Add" --x=source < benchresult.txt > pics/add_nsop.svg`

You can find lots more documentation on the benchdraw github page.

# Correctness results

The correctness of an implementation’s approximation algorithm depends a lot on the nature of the data and the quantile we’re looking at. Plotting all the data at once, without axis modifiers, shows there are clear outliers in the segmentio and influxdata implementation that make it difficult to compare. For all correctness results, lower (a smaller % difference) is better.

`benchdraw --filter="BenchmarkCorrectness/size=1000000" --x=digest --y=%difference < benchresult.txt > pics/correct_allquant.svg`

We can filter this down to just the 90% quantile. Here we see the tailspike distribution causes problems for both the segmentio and influxdata implementation.

`benchdraw --filter="BenchmarkCorrectness/size=1000000/quantile=0.900000" --x=source --y=%difference < benchresult.txt > pics/correct.svg`

Things look a lot better on the 99th percentile (the Y axis goes from 120% to .6%), but segmentio and influxdata’s correctness implementation is still outperformed by caio’s.

`benchdraw --filter="BenchmarkCorrectness/size=1000000/quantile=0.990000" --x=source --y=%difference < benchresult.txt > pics/correct_99.svg`

We can break down the graphs per tdigest implementation to see how they perform at each quantile. We see segmentio’s tailspike distribution performs poorly on the 90th percentile.

`benchdraw --filter="BenchmarkCorrectness/size=1000000/digest=segmentio" --v=3 --x=source --y=%difference < benchresult.txt > pics/correct_segment.svg`

Influxdata's implementation has a much smaller Y axis (cap at 30%), but still performs worse than normal on the 90th percentile.

`benchdraw --filter="BenchmarkCorrectness/size=1000000/digest=influxdata" --v=3 --x=source --y=%difference < benchresult.txt > pics/correct_influx.svg`

Caio's implementation does a reasonably good job at all data sources and quantiles, with a Y axis that caps at .03%

`benchdraw --filter="BenchmarkCorrectness/size=1000000/digest=caio" --v=3 --x=source --y=%difference < benchresult.txt > pics/correct_caio.svg`

Since none of the implementation’s do particularly poorly on the exponential distribution, we can narrow down into just that series of data to get results that may be more typical. Here we see segmentio’s implementation has a much larger % difference than either of caio or influxdata’s.

`benchdraw --filter="BenchmarkCorrectness/size=1000000/source=exponential" --v=3 --x=digest --y=%difference < benchresult.txt > pics/correct_exponential_all.svg`

Looking at influxdata’s implementation we can see it perform progressively poorly as the quantiles increase.

`benchdraw --filter="BenchmarkCorrectness/size=1000000/source=exponential/digest=influxdata" --v=3 --x=quantile --y=%difference < benchresult.txt > pics/correct_exponential_influxdb.svg`

For caio’s implementation, it performs well across many quantiles.

`benchdraw --filter="BenchmarkCorrectness/size=1000000/source=exponential/digest=caio" --v=3 --x=quantile --y=%difference < benchresult.txt > pics/correct_exponential_caio.svg`

# Explore the data yourself

You can explore the exact data used and benchmarks at https://github.com/cep21/tdigestbench. Learn more about benchdraw itself at https://github.com/cep21/benchdraw.