How well can one statistic predict runs scored in college baseball?
…surprisingly well, if you avoid Von Neumann’s elephant!
Professional baseball has kept track of player success with statistics like batting average (AVG) for about 150 years. John Thorn wrote a great piece about Henry Chadwick and the almost-origin of slugging percentage (SLG) in the context of 19th century baseball. Others have described the collaboration between Dodger’s executive Branch Rickey and statistician Allan Roth in the late 1940’s that led to the invention of on base percentage (OBP) — a statistic later popularized in Michael Lewis’ 2003 book Moneyball.
In 1980, Bill James coined the term sabermetrics, and since then the development of new statistics has accelerated. In this article, I take a look at two of the more popular and accessible advanced stats in the context of Division 1 college baseball.
Let’s start with on-base plus slugging (OPS = OBP + SLG). Implicit in this statistic is that SLG (which ranges from 0–4) is more important than OBP (which ranges from 0–1). You might ask the question if that relative weighting makes sense. The main currency in baseball is runs, so let’s see how well a team’s OPS can predict how many runs they score per game.
Figure 1 shows that, with a modified OPS equation, we can do a remarkable job predicting how many runs a team will score. We call this new statistic JOPS = 3.27*OBP + SLG, after Princeton University pitching coach Joe Haumacher who first devised the stat. The first thing that should jump out at you is that OBP is weighted much more heavily in JOPS than it is for OPS. In D1 simulations with unseen data, I find that JOPS predicts Runs/Game with 18% less error than OPS (Figure 1b). In contrast, OBP and SLG deserve equal weights in MLB (Figure 2), making OPS the appropriate MLB statistic. One explanation for the dominance of OBP in D1 compared to MLB is that D1 fielding is less reliable, so if a runner gets on, they are more likely to score without the need of an extra base hit. The second thing you might notice in Figure 2 is that the normalized weighting of OBP (nwOBP) varies year to year with run environment (what happened in D1 after the 2021 season, roughly coincident with the widespread use of barrel testing?).
One small increase in complexity has the potential to improve OPS and JOPS. OPS and JOPS include SLG, which intuitively (but arbitrarily) weights each type of hit by its total bases. What if each on-base outcome had its own unique weight? Weighted On-Base Average (wOBA), featured in Tango, Lichtman, and Dolphin (2006)’s The Book, is a statistic that attempts to assign a unique weight to each possible on-base outcome.
These weights, w1-w6, are published every year for Major League Baseball (MLB). Let’s see how well wOBA does for D1 college baseball. Figure 3 shows that, in D1, JOPS is a remarkably robust statistic with fairly consistent explanatory power from year to year, and gets no improvement from wOBA. In fact, while adjusted R² is similar for JOPS and wOBA, simulations with unseen data (like those in Figure 1b) reveal that JOPS is a slightly better predictor of Runs/Game. The same can be said in MLB, where JOPS ≈ OPS (Figure 2) performs just about as well as wOBA, except perhaps for 2021 (ignore 2020 and 2024 because they are COVID shortened and unfinished seasons, respectively).
The nuance here is that each time you add a tunable parameter to a model, the variance explained (R²) always will go up. So, when comparing models, and in this case, estimating how successful they will be at predicting runs scored per game, one must penalize a model based on the number of tunable parameters, and one must design simulations to see how well the model makes predictions with unseen data (i.e., data not used to train the model). This concept is exemplified by Von Neumann’s elephant: “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” Von Neumann’s point is that it is not hard to generate a strong model fit if you have enough parameters, but that model will be overfit to the training data and not capable of making accurate predictions from data it has not seen. You can imagine how important this problem becomes when using neural networks with millions of tunable parameters!
So I am going to perform one more experiment to evaluate the relative skill of JOPS and wOBA at predicting runs scored. The common practice here is called cross-validation — training the model with some fraction of the data, and then testing the model’s accuracy using the remaining data. I agree with Samuele Mazzanti’s nice piece about the pitfalls of cross-validation, and I will proceed with caution.
There is a lot to unpack in Figure 5, so I will try not to get too lost in the weeds. In addition to standard random trials (gray), I set up some annual trials to determine how bad the model can be (red, purple). The idea is to find the best fit model JOPS or wOBA for any given year, and then to use that model to predict Runs/Game in the other years (Figure 5; Xval by Year). We already know from Figure 2 that this test might not go so well, especially for D1, because the model weights vary year to year. However, that is exactly the problem we are up against — armed with data from one year, we need to predict the outcomes for the next year!
The first thing you might notice in Figure 5 is that the width of the distributions of residuals for D1 are about twice as wide for MLB. One interpretation of this result would be that in MLB, JOPS and wOBA predict Runs/Game with about half the absolute error as they do in D1. However, when standardized by the league’s variability in Runs/Game, the relative error is 0.33/1.26 = 26% for D1 and 0.15/0.51 = 29% for MLB — virtually the same.
D1 Runs/Game: range = 1.67–11.1; μ = 6.03; σ=1.26
MLB Runs/Game: range = 2.82–5.85; μ = 4.54 σ = 0.51
A second thing you can see in Figure 5 is that despite being a more complex model, wOBA does not outperform JOPS in either league. In other words, you can go from one tunable parameter with JOPS, to five tunable parameters in wOBA (or Von Neumann’s elephant; Figure 4), and yet see no appreciable improvement in your ability to predict Runs/Game. Simple models often have more relative predictive power than you think!
So, on one hand, JOPS is a remarkably simple and successful statistic in both D1 and MLB, explaining >90% of the variance in runs scored per game with just one tunable parameter (Figures 1–3). On the other hand, whether using JOPS or wOBA, you will only predict a team’s Runs/Game to within 25–30% of 1σ of the actual runs per game, 68% of the time. If you ask me which statistic I would use to build a D1 lineup, I’ll say JOPS every day.
But I am looking forward to putting other advanced stats to the test in D1 college baseball. Next up: accounting for park factor and batted ball metrics.