Using Go 1.13’s new ReportMetric API to benchmark t-digest implementations
Histograms are an important part of metrics and monitoring. Unfortunately, it is computationally expensive to accurately calculate a histogram or quantile value. A workaround is to use an approximation and a great one to use is t-digest. Unfortunately again, there’s no such thing as an accurate approximation. There are various t-digest implementations in Go but they do things slightly differently and there’s no way to track how accurate their approximations are over time and no great way for consumers of quantile approximation libraries to benchmark them against each other.
Luckily, Go 1.13’s new ReportMetric API allows us to write traditional Go benchmarks against custom metrics, such as quantile accuracy. The API is very simple and pretty intuitive, taking a value and unit.
Values that scale linearly per number of operations, like time to do something or memory allocation, should divide the reported number by
b.N. Since correctness isn’t a value that scales linearly, I calculate it individually for a specific number of values and report each as individual benchmarks.
Creating benchmark dimensions
The core problem is “How correct are tdigest implementations in various scenarios”. To quantify this problem I need to break it down into a dimension set and report all data along each dimension.
Dimension set: Implementation
I benchmark the following t-digest implementations for correctness:
To create a dimension out of implementations, I convert each implementation to a base interface
digestRun will be my dimension for the digest I’m testing against.
Dimension set: numeric series
How quantiles perform depends a lot on the order and nature of the input data. This makes the number series another dimension of my benchmark.
I decided to benchmark against the following patterns of numbers
- linearly growing sequence
- random values
- a repeating sequence
- an exponentially distributed sequence
- tail spikes: mostly small values and a few large ones
sourceRun becomes my dimension for the source of the data.
Dimension set: size and quantile
The last and simplest dimension sets are how many values I add before I evaluate correctness and which quantile I evaluate correctness against. These become my dimensions
Combining the dimensions together
From all of these dimensions I can use SubBenchmarks to create my final benchmark. Notice how each dimension becomes a subbenchmark.
As I add dimensions to my benchmarks, I create another
b.Run. I use the pattern
key=value as recommended on the benchmark data format proposal. At the most inner loop of my subbenchmarks, I compare the actual quantile result, calculated with
allwaysCorrectQuantile , to the quantile result for the given digest. To hide
ns/op from my benchmark results, I report the special value 0 for
Running the benchmarks
Since these are go benchmarks, I can run them just like normal Go benchmarks. One thing I do for my benchmark runs is filter out all the tests so just the benchmarks run. Here is what my Makefile has.
Here is what one line of the benchmark output looks like
BenchmarkCorrectness/size=1000000/source=exponential/digest=influxdata/quantile=0.999000-8 1000000000 0.118 %difference
This lines of benchmark output gives us a datapoint.
- Value=.118 %difference
There is a large amount of benchmark output. To visualize it I use
benchdraw. Benchdraw is a simple CLI made for Go’s benchmark output format and allows drawing “good enough” graphs and bar charts from dimensional data.
Here is an example command of benchdraw that visualizes the ns/op graph below. This benchdraw command plots the
source dimension on the X axis and
ns/op on the Y axis, leaving the
digest dimension as bars.
benchdraw --filter="BenchmarkTdigest_Add" --x=source < benchresult.txt > pics/add_nsop.svg
You can find lots more documentation on the benchdraw github page.
The correctness of an implementation’s approximation algorithm depends a lot on the nature of the data and the quantile we’re looking at. Plotting all the data at once, without axis modifiers, shows there are clear outliers in the segmentio and influxdata implementation that make it difficult to compare. For all correctness results, lower (a smaller % difference) is better.
benchdraw --filter="BenchmarkCorrectness/size=1000000" --x=digest --y=%difference < benchresult.txt > pics/correct_allquant.svg
We can filter this down to just the 90% quantile. Here we see the tailspike distribution causes problems for both the segmentio and influxdata implementation.
benchdraw --filter="BenchmarkCorrectness/size=1000000/quantile=0.900000" --x=source --y=%difference < benchresult.txt > pics/correct.svg
Things look a lot better on the 99th percentile (the Y axis goes from 120% to .6%), but segmentio and influxdata’s correctness implementation is still outperformed by caio’s.
benchdraw --filter="BenchmarkCorrectness/size=1000000/quantile=0.990000" --x=source --y=%difference < benchresult.txt > pics/correct_99.svg
We can break down the graphs per tdigest implementation to see how they perform at each quantile. We see segmentio’s tailspike distribution performs poorly on the 90th percentile.
benchdraw --filter="BenchmarkCorrectness/size=1000000/digest=segmentio" --v=3 --x=source --y=%difference < benchresult.txt > pics/correct_segment.svg
Influxdata's implementation has a much smaller Y axis (cap at 30%), but still performs worse than normal on the 90th percentile.
benchdraw --filter="BenchmarkCorrectness/size=1000000/digest=influxdata" --v=3 --x=source --y=%difference < benchresult.txt > pics/correct_influx.svg
Caio's implementation does a reasonably good job at all data sources and quantiles, with a Y axis that caps at .03%
benchdraw --filter="BenchmarkCorrectness/size=1000000/digest=caio" --v=3 --x=source --y=%difference < benchresult.txt > pics/correct_caio.svg
Since none of the implementation’s do particularly poorly on the exponential distribution, we can narrow down into just that series of data to get results that may be more typical. Here we see segmentio’s implementation has a much larger % difference than either of caio or influxdata’s.
benchdraw --filter="BenchmarkCorrectness/size=1000000/source=exponential" --v=3 --x=digest --y=%difference < benchresult.txt > pics/correct_exponential_all.svg
Looking at influxdata’s implementation we can see it perform progressively poorly as the quantiles increase.
benchdraw --filter="BenchmarkCorrectness/size=1000000/source=exponential/digest=influxdata" --v=3 --x=quantile --y=%difference < benchresult.txt > pics/correct_exponential_influxdb.svg
For caio’s implementation, it performs well across many quantiles.
benchdraw --filter="BenchmarkCorrectness/size=1000000/source=exponential/digest=caio" --v=3 --x=quantile --y=%difference < benchresult.txt > pics/correct_exponential_caio.svg