The law of large numbers is really the law of small weights

Allison Bishop
Proof Reading
Published in
12 min readMay 4, 2023

TL/DR: We present a framework for heuristic calculation of what kind of sample sizes are needed to precisely evaluate broker performance at the parent order level. We observe that for computing weighted averages, the number of orders is an inadequate metric of data quality, and that the inverse of the sum of squared weights is more relevant. Performing back of the envelope calculations, we argue that this metric probably needs to be above 3000 to get good estimates of aggregate stats at the parent order level for orders lasting nearly a fully trading day, and this may require sample sizes of around 10000 or above. And all of this is after filtering down to orders that are similar enough to be reasonably aggregated.

There are few easy questions in mathematics. Even seemingly easy ones can lead down deep rabbit holes: why are there infinitely many prime numbers? Why is dividing by something the same as raising it to a power of negative one? But here is an easy one I was asked by my friend in college who was perusing a scientific paper. “Is it valid to apply the Central Limit Theorem to a sample of 5 dolphins?” she asked. I looked at her in shock.

“No,” I said.

“Oh, and one of the dolphins is pregnant,” she added. “So soon it will be six?”

“Still no,” I said.

She held up the text of the paper to show me where the scientists were invoking the Central Limit Theorem to interpret their “statistical” results. I sighed. The Central Limit Theorem is a bedrock of statistics and a crowning achievement of mathematics. It was painful to see it abused in this way. Like watching your favorite actor forced to play bit parts in B-movies after a tax scandal.

The name itself, “The Central Limit Theorem,” is unapologetically proud. Brazen, even, in its assertion of centrality, but also so generic-sounding that one should be forgiven for intermittently forgetting what it means. If you need a helpful cue, I like to sing the lyrics “Everything is Gaussian” to the tune of “Everything is Awesome” from the Lego movie.

A more technical explanation is the following. Let’s imagine we take many independent samples of a probability distribution: playing the lottery over and over again, repeatedly rolling a die, continually betting on sports, trading stocks, etc. Then we average the results. The authoritative-sounding “Law of Large Numbers” says that as our sample size gets bigger and bigger, our average gets closer and closer to whatever ground truth is. In other words, time will separate skill from luck. The Central Limit Theorem goes even further, and tells us something about the shape of the error distribution of our calculated average, asserting that it becomes the familiar bell shape of a Gaussian distribution.

It has always been surprising to me, this common shape — given that independent random processes can model a wide range of vastly different things. I personally make sense of this by thinking about random processes as raw materials, and about the averaging as a sculpting process. The broad shape of the outcome says more about our sculpting process than it does about the raw material, perhaps. But however we conceptualize this phenomenon, it is crucial to remember that it only applies eventually. Just as it may be impossible to tell what a sculpture will look like from the first few chisel strikes, the Gaussian shape does not arise until we have enough samples.

In some sense, we know this. We know that a terrible baseball team will still win a few games. We know that one flirty text does not mean they like you. But the human hunger for meaning is so strong, it’s nearly impossible to resist. We text back. We become Cubs fans. We try to apply the Central Limit Theorem to a sample of 5 dolphins.

When we do exercise the discipline to wait for more data, a natural question arises: how much data would be enough? Sometimes this question has obvious units, like how many baseball games have to be played until we can tell if it will be a good season? But for evaluating trading performance of an execution algo, even the choice of units is unclear. Is it the number of orders that matters most as a measure of data quality? The number of shares traded? The number of dollars? To what extent does it matter how these things are divided over orders?

These questions are extremely relevant to Proof’s efforts to evaluate our own trading performance. We have strong robustness checks in place to prevent us from drawing spurious conclusions from small data sets, and these checks tell us that our current data set is insufficient for producing reliable precision estimates of headline stats like parent order slippage vs. arrival. Nonetheless, we can begin to estimate how much data we would need to produce these kind of stats robustly.

To get a sense of what level of data we would need to get a reliable evaluation of trading performance, we have to go back to reasons why the law of large numbers and the central limit theorem hold in other circumstances. For data like a season of baseball games or a set of patients in a clinical trial, there are two key things that set us up for statistical success. One is that the data points can be treated as independent but similar events. Another is that we care equally about each data point, and so we weight each data point equally in computing an average outcome.

Putting this into math-ese, we model our data as a series of independent, identically distributed random variables X₁, …, Xₙ. We’ll use the symbol μ to represent the real answer for the expectation of the distribution for each Xᵢ, and this is what we are trying to estimate by computing (X₁+ … + Xₙ)/n. We’ll use the symbol σ² to represent the variance of the distribution:

The law of large numbers says that as n gets larger and larger, the computed average gets closer and closer to μ. Also, we can compute the expected squared error in our estimate of μ as a vanishing function of n. One trick to make this cleaner is to define centered variables whose expectations are equal to 0:

Then we have:

Whenever i and j are not equal, the expected product here is 0 because the variables are independent and centered. Thus we have:

This shows that the expected squared error decays linearly in n. So if we want our expected squared error to be less than some bound B, all we have to do is estimate the variance σ² and require n to be large enough so that σ²/n is at most B. In other words,

Easy! We know now how much data we should wait for. [Aside: a statistics fan could object that the expected squared error being bounded is not convincing on its own, we’d also want concentration bounds around it. Fine. I love your enthusiasm. Feel free to interpret this a lower bound on the value of n you would want and go nuts deriving additional bounds.] [Aside to the aside: wouldn’t “the statistics” be a good name for a baseball team? Then you could be a “Statistics Fan” because you are a statistics fan! Ok, I’ll stop.]

If we start applying this logic to parent order performance in algo trading, however, several things can go wrong. Let’s work this out for the example of computing average slippage vs. arrival in bps. We’ll let Xᵢ here represent the slippage for the i-th order in our data set. Right away we must confront some tough questions. Is it reasonable to treat the orders as independent from each other? Maybe, but maybe not. If the same stock is traded in several orders at the same time, for example, or in close succession, that might create impactful dependencies. Let’s suppose we deal with this by filtering down to a set of orders that seem plausibly independent.

Next, is it reasonable to think of the Xᵢ values as being sampled from the same underlying distribution? Again, maybe, but maybe not. We might expect the distribution to vary based on the size of the order compared to the average daily volume in that stock, for example, and to vary based on order duration and other factors. We might need to filter our data further so that we are averaging over smaller groups of similar orders, rather than computing one average over everything.

Let’s suppose we account for this, and we get down to a set of orders that are plausibly independent and coming from the same distribution. We still probably don’t care equally about each of these orders: some may represent much more notional value than others, for example. We may reflect this by taking a weighted average in our computation of slippage:

In the case where we do weight each order equally, we would set wᵢ = 1/n for each i and the formula above would be identical to our previous computation. But typically, we would set the wᵢ values proportionally to the notional value that each order represents. This will result in weights that may differ quite dramatically from each other, even by orders of magnitude. Let’s take a look at how such weights affect the expectation of the squared error in our estimate of μ:

It turns out that the sum of the squares of the weights is smallest when all of the weights are equal to 1/n. The reason for this is easiest to see if we think geometrically: a square of side length 3 and a square of side length 1 together cover more area than two squares of side length 2 each, for example. So if we are fixing a total sum of side lengths (or analogously fixing that fractional weights must sum to 1), we cover the least area when we make things balanced. At the opposite extreme, if a single weight is equal to 1 and all of the others are equal to 0, then the sum of the squared weights is equal to 1, which is not decreasing as a function of n.

Intuitively, this makes sense. If a smaller subset of our orders dominates in terms of weight, our expected error is going to reflect that essentially smaller sample size. This means that the quantity

is a more relevant measure of our data quality than n. So for weighted averages at least, the Law of Large Numbers should really be the Law of Small Weights!

All of this begs the question: what sort of values for σ² and the inverse of the sum of the squared weights do we expect to see in practice? For σ² at least, we can look to historical market data for a back of the envelop guess. In reality, we would expect σ² to be affected by choices we make in our trading algorithms, but at a minimum, it’s hard to believe that we would achieve significantly less variance in our outcomes than if we perfectly matched the market’s volume-weighted average price (vwap) over the duration of our trading activity for each parent order.

To roughly estimate the variance inherent in the market vwap for a specified order duration of t seconds, we’ll break the regular trading day into disjoint time intervals of t seconds each. For each of the top 500 stocks in terms of notional value traded, we’ll compare the vwap over each interval to the price of the first trade in the interval. Subtracting the first trade price from the vwap and dividing by the first trade price will give us a hypothetical slippage value in bps. We’ll aggregate over time intervals and over stocks by weighting each data point proportionally to the notional value traded to get an estimate of σ².

Doing this for the first quarter of 2023, here are the estimates we get for t = 60 (one minute), t = 600 (ten minutes), t = 1800 (thirty minutes), and t = 23400 (one full regular trading day). Since the units of “squared bps” are a little hard to internalize, we also include σ by taking a square root:

As a plausible example then, let’s suppose we have a set of nearly full day orders in somewhat similar or calmer market conditions, e.g. imagine σ² = 30000. Then our expected squared error is going to be 30000 times the value of the sum of the squared weights. So if we want to get our expected squared error below 10 for example, we’ll need a data set with

Naturally, we’ll next want to know how values of the weights might behave in practice. In particular, how much smaller than n will this quantity be? This depends, of course, on the distribution of orders, but we can consider some examples. Let’s suppose, for instance, that our weights are proportional to a range from 1 to 10, with roughly 1/10 of the weights being proportional to each value. In this case, when n = 3000, the inverse of the sum of the squared weighted is more like 2350. If we want to get 3000 like this, we’ll need something like n = 3800, which is not that much worse.

Things can get much worse, however, with different weights. For example, let’s suppose that a third of our weights are proportional to 1, a third are proportional to 10, and a third are proportional to 100. In this case, when when n = 3000, the inverse of the sum of the squared weights is more like 1200. If we want to get 3000 like this, we’ll need something more like n = 7500, which is more than double. It is not too far-fetched to imagine that weight distributions in practice could add factors of 3, 5, or even 10 to the amount of data we will need to get good estimates.

Given that our actual performance is likely to display more variance than market vwap prices relative to arrive, and our weights are likely to vary considerably, and we haven’t even done concentration bounds here, it is probably safe to say that we would need values of n in the range of 10000 or higher to compute high quality slippage estimates in units of bps. And we must keep in mind, this is after we have filtered our data set down to a subset of orders that are plausibly independent in behavior and plausibly apples-to-apples in terms of their performance distribution. Ideally this would mean all of these orders happen relatively close in time, under similar market conditions. This back of the envelope calculation is consistent with our current experience of having considerably less data than this for things like TCA reports, and seeing non-robust results at the parent order level.

Waiting for this level of data as a relatively new firm is frustrating, and we understand why potential clients can be similarly frustrated that we can’t yet provide these kind of performance estimates. But if you simply can’t stand the wait, there’s an interesting paper out there somewhere about a sample of five dolphins you can read to keep you occupied. Please don’t believe the part that uses the central limit theorem though. Poor Gauss would be rolling over in his grave.

--

--