Statistics, noise, and the lessons of baseball

Allison Bishop
Proof Reading
Published in
11 min readApr 5, 2021

When I was fourteen, my father took me to my first major league baseball game. It was at Yankee stadium in that magical year of 1998, when the Yankees set a franchise record of 114 wins and swept the world series. It happened to be August 13th: the day Orlando “El Duque” Hernandez set the rookie strikeout single game record at 13. I was hooked. I closely followed the rest of the season and came to know all of the lovable quirks of the 1998 Yankees squad. Like Tino Martinez’s inability to steal bases, and the constant presence of a jar of red hot candies sitting on the bench in their dugout that never seemed to attract any takers. The announcers were obsessed with it. Where did it come from? Whose job was it to bring it out for each game? And why was it there if no one ever ate any?

The sport of baseball makes strange bedfellows of a love of numbers and a love of superstition. It is home to evangelists of statistics and devotees of tradition. And as skillful announcers fill its many loaded silences, they lay bare our hunger for meaningful stories with just the right amount of suspense. We desperately want the odds to mean enough but not too much — we want them to shape the action but not predetermine it. Everything about the game is calibrated to this goal — any individual hit is unlikely, but so is going too long without at least a few occurring. Each game is long enough for pitchers to grow tired, forcing in fresh blood and fresh plot points. And each season is crafted to pit the same teams against each other again and again, creating individually coherent arcs that thread together into a satisfying whole.

In its insistent repetition, baseball provides a fertile ground for analytics. So fertile in fact, that it tends to sprout a few weeds. “This pitcher tends to strike out left-handed batters when pitching at home on high humidity days, at least in 5 out of his last 7 appearances,” the announcer says. But whatever subset of the stats we may dismiss as cherry-picked or silly, it doesn’t shake our core belief in the meaning of the overall exercise.

The allure of “Money Ball”:

This frame of mind is so appealing that we try to carry it over into many other domains. “It’s like Money Ball for …” has become a convenient shorthand for data science enthusiasts. But baseball is a delicately crafted stage — very few settings in real life have the same degree of repetition of nearly identical conditions. It is not surprising that the statistics that behave quite well in baseball may behave quite poorly in politics, for example, where we have small sample of sizes of elections and external conditions that change dramatically across time. Overextending the statistical lessons of baseball ironically reinforces the philosophical lessons of baseball: we should believe data, yes. But we should not believe it too much. In particular, we cannot afford to neglect the story of how it is collected, by whom, and for what purpose. But this is, after all, much more interesting. What’s the fun of watching the World Series if you haven’t followed the journey along the way?

It is perhaps appropriate that baseball season has arrived once again as Proof begins trading for pilot customers. All of this looms closely in my mind as we pour eagerly over the data of our early trades to learn what we can about our performance. If you zoom in narrowly enough on the stock market, it looks like baseball. Repeating interactions between various types of traders, with microsecond-level dynamics that can be measured as cleanly as the speed of a fastball. You can stretch the analogy even further perhaps — viewing an executing broker like Proof as a pitcher, pitching orders under the direction of a client who acts like a manager, taking them in and out of the game. Like a pitcher, a broker’s actions are not the sole determinant of the outcome. You can do everything well and still lose because of the actions of the players around you, on both your team and the opposing team. Stats like win-loss record can be a fairly noisy metric of pitcher quality, as they don’t reflect the variables outside of the pitcher’s control, and there can be bias in the kinds of situations different pitchers happen to be brought into. Nonetheless, we tend to view such statistics as informative on a holistic level for pitchers who have pitched a sufficiently large number of innings. Perhaps we do the same with trading execution stats like slippage vs. arrival and slippage vs. vwap.

The noisier reality:

But if you zoom out a bit, the stock market is a lot less like baseball. It’s not one game happening at a time, it’s roughly 10,000 games, all happening at once on the same field and interfering with one another! And there are powerful external forces in the stands and beyond affecting the games — like torrential rains that quickly come and go, powerful winds that gust up specifically on certain portions of the field, and wildly varying levels of crowd noise. The view from the mound is dramatically complex. Now the action in your game is constantly facing spill over effects from the other games and the environment, and your manager is pulling you in and out of different parts of different games at a dizzying pace. You start to suspect that your wins and losses may have more to do with the luck of when/where you get put in rather than your technique and strategy.

In the heavy noise of the market environment, what can brokers and clients do to mitigate the role of chance and draw meaningful lessons from broker performance? A basic starting point is to inventory the primary sources of noise in our metrics, and then to take steps to mitigate and estimate each of them.

The sources of bias and variance

One obvious source of bias and variance is the environment and the spill over effects from trading activity in other symbols. This includes trends in market activity that are happening generally or across sectors, factors, etc. These trends can drive prices up and down in ways that obscure the impact of narrow and specific trading activity in a stock, like a blinding sun can obscure the talent of an outfielder.

A second source of variance is natural fluctuation in the trading activity and prices of a particular symbol, independent of wider market effects. The roughly contemporaneous actions of other traders in the same stock is neither wholly a reaction to our trading nor wholly independent of it. This can make the isolation of cause and effect nearly impossible to tease out at some time scales. Did that error in the fifth inning really “change the momentum of the game” or would a similar thing have happened anyway? We can never fully know.

Another considerable source of bias is the selection of which symbols, time periods, and total share amounts to trade. Especially when sample sizes are small, it’s very difficult to account for the idiosyncrasies of context. Much like a relief pitcher’s record will be heavily influenced by exactly what kind of situations they are typically brought into.

Finally, a source of confusion in the interpretation of performance statistics is the potential for over-measuring. We can calculate so many different stats, over so many different time periods with different parameters, that some of them are bound to look good, and some of them are bound to look bad. An even more insidious problem can arise even when we measure a small number of things, but we allow what we measure to be influenced by the data itself. This may happen subconsciously when we tweak the parameters of our measurements because the results don’t yet match our expectations. This is what I suspect is happening when an announcer is telling us something oddly specific like: “he’s had 11 hits in his last 32 at-bats…” The announcer may start with a vague impression that is influenced by observation (this hitter seems to be doing well these days) and then goes looking for a set of parameters for the stat that validate that impression. This is not wrong per se, the stat is true, but the process can be misleading. The same batter could have perhaps been portrayed as having 13 hits in his last 53 at-bats or 1 hit in his last 5 at-bats, both of which would convey a different impression.

But isn’t all of this solved by “statistical significance”?

You’ve probably heard of these problems before, and many other problems with statistics. You may have also heard the phrase “statistical significance” mumbled as an incantation to wish such problems away. Unfortunately, it doesn’t really work.

If something is “statistically significant,” that’s supposed to mean that is “unlikely to be due to chance.” But there’s a lot of fine print in the technical definition. To calculate the probability that something happens by chance, you have to define a fairly detailed probabilistic model of the world, and that model is going to diverge from reality in lots of ways. So really when we say something is statistically significant, we’re saying that it had a relatively small probability of happening by chance in some particular imaginary world that we’ve defined. Also the over-measurement problem still applies — if we go looking for enough “unlikely” events, it becomes likely that we will start finding some by coincidence. Especially if we allow ourselves to reformulate our queries as we look at the data. These phenomena have led to a replication crisis in several statistically-based research disciplines [see e.g. here and here]. But they have also spurred on the development of a field of research known as “robust statistics” that is making leaps and bounds recently. [Some recent papers can be found here and here.]

So what can we do?

There are many strategies we can employ simultaneously to mitigate the effect of noise on our execution quality stats. And for the amount of noise that stubbornly remains, we can at least try to get a sense of its severity so that we are less prone to jump to false conclusions. Let’s try to tackle our sources of bias and variance one by one.

In terms of the environment and cross-symbol trends, we can try to model the correlations between various symbols and ETFs, which can serve as reasonable proxies for the general market or for various sectors and factors. We have developed a rudimentary technique for doing this, which we call “distilling.” [Our prior blog post on this can be found here.] This kind of modeling can be used to subtract out some fraction of the noise in our metrics that is due to broader market fluctuations.

For natural variation due to independent activity in the stock we are trading, we can try to get a rough sense of the magnitude of this source of variance by looking at comparable samples of trading activity in that stock on days when we were not active. For example, if we want to know how much fluctuation to expect in a statistic like slippage versus arrival, we can compare volume-weighted average prices to arrival prices for the same stock on days/times closely preceding our trading activity. Computing this on top of “distilling” techniques can give us some sense of how much noise we expect to be injected into our stats as a result of contemporaneous and relatively independent trading activity in the same stocks.

If we were a buyside firm splitting our orders over several brokers and trying to compare their performance to each other, we could attempt to control for selection bias in stocks/sizes/times by trying to equalize various characteristics of our orders over large enough sample sizes of flow that we send to each broker. Doing this well is tricky. As an agency broker, Proof does not control what orders we receive from clients, and we have to live with whatever distribution of stocks/shares/times we happen to get. Like a relief pitcher, we have to take the mound whenever we’re called upon, and we don’t get to pick and choose the circumstances. We nonetheless can get a sense of how much parent order selection may be affecting our performance stats by seeing how the stats change if we remove various small subsets of orders. For example, we could see how much lower or how much higher we can make a given stat by removing say 1% of the orders in our data set [see here for a more thorough technical development of this concept]. However much we can make the stat move by this manipulation is a rough measure of how much order selection may be driving our numbers: after all, a client randomly splitting their orders between different brokers could have easily sent that 1% of orders to someone else instead, and our performance stat would have moved accordingly.

Finally, we can counteract the tendency towards over-measurement by imposing a disciplined structure on what we measure and when. We will choose a set of fundamental time units, e.g. 1 day, 1 week, 1 month, 1 quarter, 1 year, and only group data by those units. For example, we will never run analyses sometimes on 8 days worth of trading data, and sometimes on 11 days, etc. For certain stats, a minimal sample size of orders will be imposed to give our noise mitigation and estimation measures a fighting chance. There is very little hope, for example, of removing enough noise to draw conclusions from the performance of any single order (assuming things do not go obviously horribly wrong). This is the same basic logic behind the requirement that a baseball player needs at least 502 plate appearances to be eligible for the season batting title. [Aside: the seemingly strange number of 502 comes from taking 3.1 appearances per game and multiplying it by a 162-game regular season schedule.]

We will also have stateful analyses reports instead of stateless ones. For example, suppose we have a filter that looks for a “bad” event that we believe will happen by chance for 0.001% of orders (i.e. 1 out of every 100,000 times). Let’s say we run the filter as part of a periodic analysis on orders that is performed weekly, and we have roughly 1,000 orders each week to analyze. [Note: all of these numbers are made up as a hypothetical.] Suppose that on the 42nd week, we find one instance of the bad event. Should we be worried?

A stateless report only considers the current context of that week’s data, and can only tell us: you saw one bad event out of 1,000 orders. The chance of the bad event happening in one of these 1,000 orders was about 1%, so yeah, this sounds kind of bad!

A stateful report, however, can remember how many times it has run this same filter and what the cumulative results have been. And so it can give us more nuanced guidance: this week you saw one bad event out of 1,000. Over the lifetime of this report, you’ve seen 1 bad event out of 42,000 orders. Seems much less bad in this context.

All combined, these enhancements to our analyses of order performance can help protect our time and energy from being misdirected into chasing phantoms from noise. They will probably not be enough to allow us to have a clear view of performance in the early stages, while our sample sizes are small and our techniques for distilling are still rudimentary. But we are hopeful that they represent a solid scientific foundation on which we can continue to innovate and build.

--

--