Approximate Counting and Statistical Significance: Two Great Ideas That Don’t Play Nice
Approximate counting refers to a range of algorithms designed to count a large number of events using less space and computation, at the cost of some error in the count itself. A subset of these algorithms solve the distinctElements problem — estimating the cardinality of a set.
Statistical significance is a fundamental concept used to infer the difference between signal and noise. The idea is to determine if your data could have been generated by fluctuation unrelated to the signal you want to measure. If the likelihood is low, then chances are your data is a good representation of the underlying signal.
In this blog post, we’ll show that while approximate counting for distinctElements (hereafter just ‘approximate counting’) does provide good estimates of quantities like conversion rates and reach, it fails in a pretty big way on statistical significance calculations such as p-values.
Background on Approximate Counting.
Popular approximate counting algorithms for distinctElements include HLL (Hyper-Log-Log), K’th Minimum Value, and Adaptive Sampling. They all rely on properties of a randomized hash function — in particular that the hash is uniformly distributed — to infer a measurement of the population (the whole set) from a carefully chosen sample.
They also all have similar error properties. Approximate counting estimates tend to be unbiased and normally distributed with variance proportional to the count squared times a tuning parameter,
Here, C_k is almost always proportional to 1/k, and the k is set to trade-off between space & complexity on the one hand, and error on the other.
For example, the K’th Minimum Value algorithm keeps track of the k’th smallest hashcode for each unique element and estimates cardinality by
where h_k+1 is the (k+1)’th smallest hash code. And has variance
which is roughly n^2 / k for large n and k.
The algorithms also differ on their ability to
- transition from exact counts to approximations once cardinality crosses some threshold
- combine multiple streams of data into one estimate
- do ad-hoc slicing to estimate cardinality of subpopulations
- perform more complex calculations than the cardinality of the union of streams, such as the cardinality of intersections.
Generally speaking, while HLL has good tradeoffs between space, complexity and error rates, it does not allow transitioning from exact counts to approximate, ad-hoc slicing, or more complex set operations.
In this report we focus on the K’th Minimum Value (KMV) which does allow all these things. For instance, while the true cardinality of the set is below k, exact counts are returned. It’s also pretty simple to implement.
More algorithms, implementation notes, and general theory can be found in a fantastic recent paper from Yahoo labs.
We simulated randomized controlled trials (RCTs) with baseline and variation conversion rates both equal to 0.5, in other words A/A tests, and visitor counts, n = 1K, 10K, and 100K. Visitor counts were then approximated using KMV with k = 100 for each n, and also k = 1K and 10K for n = 100K. Each combination of k and n values was repeated 1,000 times.
The first two plots below show histograms of the percent difference between the true observed conversion rate
and the approximated
As k increases, the approximation error decreases, and remains constant as n increases.
This coincides well with standard theory which shows
The standard deviation of the approximate counting conversion rate, relative to the actual one, drops like the square root of Ck, plus terms that go to zero at a rate proportional to Ck. We see that for KMV, by the time we look at the 10K’th minimum value (out of 100K visitors), we are not more than a few percent off in approximating conversion rates.
Similar results hold for quantities like reach, variance estimates, etc, but unfortunately, significance calculations do not enjoy the same positive results.
In addition to conversion rates, we also calculated p-Values for each simulated RCT. P-Values for RCTs denote the likelihood of seeing variation and baseline conversion rates, assuming any observed difference is due to random fluctuation. The particular p-Values we use are the sequential ones computed by Optimizely’s Stats Engine, but as we show in the next section, the following behavior is typical of any p-Value.
Below is a scatter plot comparing approximate p-Values — replacing n by n_hat in the formulas — to regular p-Values on the same battery of simulated tests. Ideally, you’d expect points to line up on the straight dotted line, or at least be close. Instead, there is practically no relationship. Running standard linear regressions of regular p-Values on approximate ones supports this.
Why is significance so impacted?
We can understand what’s going on by looking at the t statistic, defined as
where v and b subscripts denote variation and baseline. The t statistic is a fundamental quantity in calculating significance for A/B tests because it measures the signal (difference in conversion rates) relative to noise (standard deviation). All A/B tests make a significant call when they decide there is a large amount of signal relative to the anticipated noise in the test. Indeed, Stats Engine calculates significance as a function of the t statistic,
Not worrying about what A is, we see that an increase in t brings us closer to high significance. The full formula can be found by re-ordering equation (2) in Optimizely’s Stats Engine whitepaper.
Focusing on conversion rate differences, it’s well known that without approximate counting,
In the above, I used n_e to denote # conversions. The key feature of the formula is that SD gets smaller and smaller as both n_b and n_v increase. Combining this with the t statistic, and stat sig equations, we find that we can detect smaller differences with more visitors.
But when we use approximate counting, the n_b and n_v are themselves approximations, and so they contribute to the overall noise. A standard method for approximating more complicated variances, known as the delta method, allows us to see the impact of this additional noise,
With approximate counting, the noise in our measurement of the effect size is higher than the t statistic anticipates. We might consider the most obvious fix, inflate the noise estimate,
But this leads to a pretty big problem. Our ability to detect smaller effects is directly linked to how sensitive we make our approximate counting algorithm, and there is now a lower bound on how small of effects we can detect.
An additional cost of approximate counting is decreased sensitivity of statistical significance to detect effect sizes below the approximation noise.
In fact, since Ck is generally proportional to 1/k, we could achieve the same tradeoff between space, complexity and signal detection by simply setting a maximum sample size of k for A/B Tests.
While approximate counting is a great method to cheaply estimate quantities like conversion rates and reach, as soon as these quantities are used for statistical inference — trying to distinguish signal from noise — the usefulness of such methods is diminished. Once again we are faced with the classic adage, there is no free lunch.
We also plotted the percent difference histograms for the t statistic (before “fixing” it):
The histogram is of the absolute value of the percent difference, so values to the left (towards 0 percent difference) are good, and values to the right are bad. Notice that the x-axis is on a logarithmic scale, meaning it is not uncommon to see approximate t statistics more than a factor of 10 off from the actual value.
Second, increasing k does not have a visible reduction in approximation error, and there seems to be a variable about of bias as either k or n change. Another point of evidence towards caution in using approximate counting to calculate significance.
Psst. Like solving problems like this? We’re hiring!