How well can one statistic predict runs scored in college baseball?

…surprisingly well, if you avoid Von Neumann’s elephant!

Published in

SABR Tooth Tigers

6 min readJun 9, 2024

Professional baseball has kept track of player success with statistics like batting average (AVG) for about 150 years. John Thorn wrote a great piece about Henry Chadwick and the almost-origin of slugging percentage (SLG) in the context of 19th century baseball. Others have described the collaboration between Dodger’s executive Branch Rickey and statistician Allan Roth in the late 1940’s that led to the invention of on base percentage (OBP) — a statistic later popularized in Michael Lewis’ 2003 book Moneyball.

H: Hit; AB: At bat; 1B: Single; 2B: Double; 3B: Triple; HR: Home Run; BB: Walk; HBP: Hit by Pitch; SF: Sac Fly

In 1980, Bill James coined the term sabermetrics, and since then the development of new statistics has accelerated. In this article, I take a look at two of the more popular and accessible advanced stats in the context of Division 1 college baseball.

Let’s start with on-base plus slugging (OPS = OBP + SLG). Implicit in this statistic is that SLG (which ranges from 0–4) is more important than OBP (which ranges from 0–1). You might ask the question if that relative weighting makes sense. The main currency in baseball is runs, so let’s see how well a team’s OPS can predict how many runs they score per game.

Figure 1. I build the simplest possible linear regression model. I ask, how well can I tune one weight (w1) in this equation: w1*OBP + SLG = Predicted Run/Game to minimize the squared difference between Actual Runs/Game and Predicted Runs/Game. (a) depicts actual versus model-predicted runs/game for each D1 College baseball team in 2024, and you can see that this model works remarkably well, explaining 91% of the variance with a slope of 1. (b) depicts the result of 1000 simulations where 60% of the data are used to fit OPS and JOPS models, and the remaining 40% of the data are used to test model predictions. You can see that while OPS also is a superb model with 7.7% lower R² than JOPS, OPS has 26% higher prediction error than JOPS. The Komolgorov-Smirnov test (kstest) reveals that neither distribution of residual Runs/Game is normally distributed, suggesting that some of the remaining 9% of unexplained variance may come from a non-random process we have not accounted for.

Figure 1 shows that, with a modified OPS equation, we can do a remarkable job predicting how many runs a team will score. We call this new statistic JOPS = 3.27*OBP + SLG, after Princeton University pitching coach Joe Haumacher who first devised the stat. The first thing that should jump out at you is that OBP is weighted much more heavily in JOPS than it is for OPS. In D1 simulations with unseen data, I find that JOPS predicts Runs/Game with 18% less error than OPS (Figure 1b). In contrast, OBP and SLG deserve equal weights in MLB (Figure 2), making OPS the appropriate MLB statistic. One explanation for the dominance of OBP in D1 compared to MLB is that D1 fielding is less reliable, so if a runner gets on, they are more likely to score without the need of an extra base hit. The second thing you might notice in Figure 2 is that the normalized weighting of OBP (nwOBP) varies year to year with run environment (what happened in D1 after the 2021 season, roughly coincident with the widespread use of barrel testing?).

Figure 2. Inter-annual variability in the relative weight of OBP versus SLG in the JOPS statistic. The linear regression is less accurate for MLB because there are just 30 teams (compared to ~300 teams in NCAA Division 1). In the MLB, we can never be 95% confident that OBP and SLG should not be equally weighted (i.e., the OPS statistic). In contrast, in D1, we can always be 99% confident that OBP should be weighted 2.4–3.4 times more heavily than SLG when predicting team runs scored.

One small increase in complexity has the potential to improve OPS and JOPS. OPS and JOPS include SLG, which intuitively (but arbitrarily) weights each type of hit by its total bases. What if each on-base outcome had its own unique weight? Weighted On-Base Average (wOBA), featured in Tango, Lichtman, and Dolphin (2006)’s The Book, is a statistic that attempts to assign a unique weight to each possible on-base outcome.

uBB: unintentional walks

These weights, w1-w6, are published every year for Major League Baseball (MLB). Let’s see how well wOBA does for D1 college baseball. Figure 3 shows that, in D1, JOPS is a remarkably robust statistic with fairly consistent explanatory power from year to year, and gets no improvement from wOBA. In fact, while adjusted R² is similar for JOPS and wOBA, simulations with unseen data (like those in Figure 1b) reveal that JOPS is a slightly better predictor of Runs/Game. The same can be said in MLB, where JOPS ≈ OPS (Figure 2) performs just about as well as wOBA, except perhaps for 2021 (ignore 2020 and 2024 because they are COVID shortened and unfinished seasons, respectively).

Figure 3. For each year, separately in D1 and MLB, JOPS and wOBA weights are calculated to best predict runs scored per game. I use Adjusted R² to partially account for model complexity when comparing two models. Symbol size scales inversely with the Bayesian Information Criterion (BIC), which is another attempt at measuring the generalizability of different models applied to the same dataset (e.g., JOPS vs. wOBA for MLB only). BIC helps to choose the model that best explains the data without overfitting to the training data (Figure 4). You can see that in 2020, both D1 and MLB have very high BIC (bad; small symbol size) because the COVID-shortened season reduced the number of available data.

n: samples; p: tunable parameters (weights).

The nuance here is that each time you add a tunable parameter to a model, the variance explained (R²) always will go up. So, when comparing models, and in this case, estimating how successful they will be at predicting runs scored per game, one must penalize a model based on the number of tunable parameters, and one must design simulations to see how well the model makes predictions with unseen data (i.e., data not used to train the model). This concept is exemplified by Von Neumann’s elephant: “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” Von Neumann’s point is that it is not hard to generate a strong model fit if you have enough parameters, but that model will be overfit to the training data and not capable of making accurate predictions from data it has not seen. You can imagine how important this problem becomes when using neural networks with millions of tunable parameters!

Figure 4. One of the most under-appreciated problems in the elaborate regressions of baseball analytics is overfitting. Perhaps the most famous legend of overfitting is encapsulated in a von Neumann quotation: “With four parameters (in the legend above) I can fit an elephant, and with five (the fifth is the black eye) I can make him wiggle his trunk.” The point is, model fit will always improve with each additional tunable parameter, but model ‘fitness’ might not. In other words, when you go from JOPS to wOBA, you are adding four tunable parameters and you are improving R², but are you actually improving how well the model will make predictions with unseen data?

So I am going to perform one more experiment to evaluate the relative skill of JOPS and wOBA at predicting runs scored. The common practice here is called cross-validation — training the model with some fraction of the data, and then testing the model’s accuracy using the remaining data. I agree with Samuele Mazzanti’s nice piece about the pitfalls of cross-validation, and I will proceed with caution.

There is a lot to unpack in Figure 5, so I will try not to get too lost in the weeds. In addition to standard random trials (gray), I set up some annual trials to determine how bad the model can be (red, purple). The idea is to find the best fit model JOPS or wOBA for any given year, and then to use that model to predict Runs/Game in the other years (Figure 5; Xval by Year). We already know from Figure 2 that this test might not go so well, especially for D1, because the model weights vary year to year. However, that is exactly the problem we are up against — armed with data from one year, we need to predict the outcomes for the next year!

Figure 5. In this experiment, I combine years 2018, 2019, 2021, 2022, and 2023. For each trial, I randomly select 60% of the data, perform the linear regression, and then use the best fit model to predict Runs/Game on the remaining 40% of the data. I perform 250 trials, each with a different random subset of data. Then I plot the distribution of misfits (residuals; gray curves) between predicted Runs/Game and actual Runs/Game. I also perform annual trials (XVal by Year), where the model is tuned with data from one year, and then tasked to predict Runs/Game in other years. The thicker orange and blue curves depict the average of all the trials for D1 and MLB, respectively. The p-value reported in each legend is the result of a D’Agostino-Pearson Test of normality — the p-values < 0.01 reveal that all of these distributions of residuals are significantly different from a normal distribution.

The first thing you might notice in Figure 5 is that the width of the distributions of residuals for D1 are about twice as wide for MLB. One interpretation of this result would be that in MLB, JOPS and wOBA predict Runs/Game with about half the absolute error as they do in D1. However, when standardized by the league’s variability in Runs/Game, the relative error is 0.33/1.26 = 26% for D1 and 0.15/0.51 = 29% for MLB — virtually the same.

D1 Runs/Game: range = 1.67–11.1; μ = 6.03; σ=1.26
MLB Runs/Game: range = 2.82–5.85; μ = 4.54 σ = 0.51

A second thing you can see in Figure 5 is that despite being a more complex model, wOBA does not outperform JOPS in either league. In other words, you can go from one tunable parameter with JOPS, to five tunable parameters in wOBA (or Von Neumann’s elephant; Figure 4), and yet see no appreciable improvement in your ability to predict Runs/Game. Simple models often have more relative predictive power than you think!

So, on one hand, JOPS is a remarkably simple and successful statistic in both D1 and MLB, explaining >90% of the variance in runs scored per game with just one tunable parameter (Figures 1–3). On the other hand, whether using JOPS or wOBA, you will only predict a team’s Runs/Game to within 25–30% of 1σ of the actual runs per game, 68% of the time. If you ask me which statistic I would use to build a D1 lineup, I’ll say JOPS every day.

Regressing to the mean in College Baseball

How do you estimate the skill of college batters without much data?

medium.com

But I am looking forward to putting other advanced stats to the test in D1 college baseball. Next up: accounting for park factor and batted ball metrics.

How well can one statistic predict runs scored in college baseball?

…surprisingly well, if you avoid Von Neumann’s elephant!

Regressing to the mean in College Baseball

How do you estimate the skill of college batters without much data?

Written by Adam Maloof