Regressing to the mean in College Baseball

Adam Maloof
SABR Tooth Tigers
Published in
14 min readJun 19, 2024

How do you estimate the skill of college batters without much data?

Recently, I watched as our coaches voted for Ivy League awards. They struggled to compare players with 200 at bats (AB) to players with 100 AB in the short Division 1 (D1) college season. Would a batter with amazing stats at 100 AB keep it up for 200 AB, or would their hot streak come to an end? At the same time, a student of mine was applying to work with a Major League Baseball (MLB) organization and was given a similar question during his interview in the context of choosing which players to draft. What assumptions should these coaches and front offices make about ‘regressing to the mean’ (RTM) when selecting players?

Schall & Smith (~2000) describe the concept as follows: ‘Players who do exceptionally well in any particular season typically do not do as well the subsequent season–they regress toward the mean.’ What people have in mind here is that every player has an intrinsic skill (p), but will perform at a different level than skill alone predicts because of changing conditions (everything from random noise (Figure 1a) to different competition or changing skill (Figure 1b)).

Figure 1. These plots depict a binomial process called a Bernoulli trial (like a coin flip). Instead of a probability p = 0.5 of flipping a tails with a standard coin, I simulate batting average (AVG) with the success of the batter getting a hit set at pₕ = 0.3 (equivalent to the batter’s intrinsic skill, and an AVG of 0.300). The simulation in (a) is simple: for each successive AB, I add a 0 or 1, and divide by the total number of AB. The blue envelope depicts 1000 such simulations. (b) considers a single external forcing — in this case a step increase in pₕ at AB=100, simulating the abrupt change in competition from Spring Training with ACC teams to the main season with Ivy League teams. You can imagine any number of additional external forcings.

In Figure 1a, the blue 95% envelope of possible AVG trajectories is quite wide at first, and then narrows with increasing AB. It is important to realize that this AVG-narrowing is completely unrelated to the likelihood of a batter getting hot or descending into a slump, it is just that each successive outcome has a smaller impact on AVG = Hits/AB when AB gets large. In fact, think again of the coin flip — each successive flip still has p = 0.5 for tails, regardless of whether the previous hundred flips were all tails or not — the process has no memory of previous events, and regressing to the mean does not imply a reduction in variability from at bat to at bat.

Okay, back to the question at hand. Let’s say a coach or general manager understands the Bernoulli trial behind Figure 1a, and has information on a batter after some number of AB. If the decision maker assumes there are no external factors (like the sudden increase in pₕ depicted in Figure 1b), they can calculate the standard deviation (σ) of the binomial distribution for for any n (AB) and p (probability of a hit):

Equation 1. Standard deviation from random binomial noise. n = AB; p= skill.

So plugin p=p=0.3 and n=150 and you get σ = 0.037, just like the measure of the 1000 simulations in Figure 1a. In plain English, this result means that after 150 AB, you can be 68% sure that the AVG you measure is within ±0.037 of the batter’s intrinsic skill, pₕ. You should be thinking to yourself…that is not very good! The difference between a 0.300 batter and a 0.263 or 0.337 batter is huge, and even that range only represents 68% of outcomes. If you wait and measure their AVG after 450 AB, then σ = 0.022, which is a bit better. Another way to look at this concept is summarized in Figure 2.

Figure 2. (a) Each curve is a cross-section along the y-axis of Figure 1a at a different AB. The red dot shows that 21% of batters with AVG=0.263 after 450 AB actually have a true skill of pₕ=0.300. It gets worse if you only can observe the batter once at AB=150 (blue dot) and find AVG=0.263, because here there is 50% chance that the batter’s true skill is pₕ=0.300. Even if you measure AVG=0.300 after some number of ABs, you would only be correct that the hitter has pₕ=0.300 68% of the time (the apex of these curves). (b) is the same as (a), but I focus on the AB=150 cross section and compare the distribution of a pₕ=0.300 with a pₕ=0.263 batter. As in (a), the blue dot marks where the mean for the pₕ=0.263 distribution intersects the pₕ=0.300 distribution (i.e., where the pₕ=0.300 batter is misinterpreted as a pₕ=0.263 batter y≈50% of the time). While measuring virtually any AVG between 0.170 and 0.370 could be slightly ambiguous for determining pₕ, the overlap distribution (gray area) highlights the range of AVGs (0.283±0.035) that would be strongly ambiguous for deciphering between pₕ=0.300 and pₕ=0.263.

There are many articles (e.g., Staude, 2013) talking about how many AB you need to observe for any particular batting statistic to be an accurate measure of the player’s intrinsic skill. I don’t think there is a single correct AB cutoff that works for every question. Instead, a decision maker needs to be aware of the uncertainty associated with any claim of one batter being more skilled than another. I will explore the concept of uncertainty in more depth later on. For now, let’s indulge this idea of a statistic stabilizing after a certain number of AB.

So far, I have presented you with theory that has been well understood since the work of Bernoulli nearly 300 years. Hopefully this review gives you some intuition, and illustrates what people [should] have in mind when they talk about regressing toward the mean. But I still want to know how to vote for Ivy League awards when all the candidates had a different number of ABs! It’s time to look at some real data!

Let’s start simple and examine end-of-season D1 batting statistics. I ask the question, how much of a player’s year-to-year performance change is due to regression-toward-the-mean (RTM)? Will a particularly good stretch always be followed by an average stretch? Are there other processes at work, like changes in skill (p)? In most MLB articles I have read, RTM is treated as a given, and the primary challenge authors set out to solve is how to see through RTM to determine a player’s skill. We will get there, but first let’s evaluate to what extent RTM is at work.

I will use Z-score to compare statistics between years because this form of standardization helps adjust for league-wide changes in run environment (e.g., changes in ball manufacturing) — the z-score tells us how much better or worse a batter is compared to his peers.

Equation 2. Z-score. x: data (e.g., AVG, SLG, OBP, etc..); μ: mean of data; σ: standard deviation of data.

Figure 3a depicts what fraction of players improve their slugging percentage (SLG) in year 2, when starting from a particular SLG Z-score in year 1 (Z(1). If the change in Z-score is entirely due to RTM, then we can predict Z(2) if we know (i) the player’s Z(1), and (ii) the coefficient of regression (a = R*(σ₂/σ₁) = R, where R is the Pearson correlation coefficient and σ₂=σ₁=1) between Z(1) and Z(2) for all players:

Equation 3. Theoretical regression-to-the-mean. R: Pearson correlation coefficient; Z(1): Z-score in year 1; Z(2) Z-score in year 2.

In most circumstances, |R|<1, so the expected value of Z(2) given Z(1) is proportionally reduced towards the mean of Z=0. One complication is that this theoretical expression assumes that you can make infinite measurements to zero out the noise. In practice, I can simulate what this relationship should look like when I have a limited number of data and noise contributes to the signal:

Equation 4. Practical regression-to-the-mean with a finite data set. I introduce noise with N, the standard normal distribution (μ=0 and σ=1), modified to have σ = √(1-R²).

As predicted by RTM theory, Figure 3a shows that few players with high Z(1) improved in year 2 (i.e., the sophomore slump), while most players with low Z(1) improved in year 2. If we compare the actual data to the simulation of pure RTM in Figure 3a, we find a good fit, but we get a non-normal distribution of residuals with non-zero mean (Figure 3b). Figure 3b shows that, amongst all D1 batters, SLG improved by 0.006±0.004 due to a non-RTM process. AVG, OBP, and JOPS all show similar increases from year 1 to year 2, and all stats continue to increase at similar rates in years 3, 4, and 5 (although this analysis becomes more difficult because D1 careers are short and there are fewer data available for years 3, 4 and 5). To summarize, D1 players keep increasing their intrinsic skill throughout their college careers, making regression to the mean a moving target.

Figure 3. I take all end-of-season data from Division 1 NCAA Baseball between 2017–2024. For every qualified hitter (≥100 plate appearances), I compute their Z-score in each batting statistic for their year1, year2, year3, etc. seasons. (a) shows what fraction of players with a given Z-score in their first season will improve in their second season. The size of the orange circle depicts how many players (n) contributed to each calculation. Overlain on the actual data is the theoretical distribution if 100% of the variance came from regression-to-the-mean (RTM), as well as an example simulation of the distribution from RTM limited by the sample size. (b) shows the distribution of residuals when the actual data are compared to the simulated data from a pure RTM process. The K-S test demonstrates that the residuals do not form a normal distribution, and the T-test indicates that the mean of the residuals (μ=0.06) is distinct from μ=0 at a >99% confidence level. I infer from the distribution of residuals that this 0.006±0.004 improvement in SLG from year 1 to year 2 across the league is not from changes in run environment, nor is it from RTM — college players get better every year!

We know from Figure 3 that most players likely will continue to improve during their college careers. But how can we know how many plate appearances it will take to regress, and to what mean?

A typical approach in the MLB blogosphere (e.g., Kincaid, 2011; Birnbaum, 2011) is to lump all hitters together for a particular year, and search for the plate appearance (PA) when variance in a statistic (σ²) due to expected random binomial noise (Equation 1; or multinomial noise for SLG and JOPS) is equal to the variance in skill. The variance in skill is taken as σ²(data) — σ²(binomial); or in plain english, the variance in batter skill is the difference between the total variance in the data and the expected variance from random noise (luck). The idea here is that the PA at which σ²(data) — σ²(binomial) = σ²(binomial) marks when we should be able to see through the noise and distinguish between players with different skill. In Figure 4b, you can see that this cross-over point for BB/PA occurs at about 100 PA in 2023. For other statistics, like AVG, the cross-over does not occur until PA≈210. And for other statistics, like SLG, there is no cross over at all during the short D1 season (Figure 5b).

Figure 4. Stabilization of walks per plate appearance (BB/PA) in Division 1 (SEC, ACC, PAC-12, Big-12, and Ivy) during 2023. n describes how many batters reach a particular value of PA. (a) depicts how the mean (m) and standard deviation (s) of league-wide BB/PA evolve with each increasing PA. Also shown is the expected standard deviation (e) from random binomial noise (Equation 1). (b) illustrates how the variance from random binomial noise (eBB/PA)² and skill (sBB/PA² — eBB/PA²) evolve with increasing PA. The crossover point, where the variance from noise is less than the variance from skill, marks the point where the statistic has ‘stabilized’ enough to allow you to distinguish between the skill of different players (Carleton, 2012).
Figure 5. Stabilization of slugging percentage (SLG) in Division 1 (SEC, ACC, PAC-12, Big-12, and Ivy) during 2023. n describes how many batters reach a particular value of PA. (a) depicts how the mean (m) and standard deviation (s) of league-wide SLG evolve with each increasing PA. Also shown is the expected standard deviation (e) from random binomial noise (Equation 1). (b) illustrates how the variance from random binomial noise (eSLG/PA)² and skill (sSLG/PA² — eSLG/PA²) evolve with increasing PA. The crossover point, where the variance from noise is less than the variance from skill, marks the point where the statistic has ‘stabilized’ enough to allow you to distinguish between the skill of different players (Carleton, 2012).

A decision maker might use the results in Figures 4 & 5 to decide whether one player is better than another in a particular stat if they think those players have had enough plate appearances to stabilize that stat. Tango, Lichtman, & Dolphin (2014), Birnbaum (2011) and many others take this approach further and attempt to regress a player’s stat to the mean by adding a cross-over-point’s worth of PAs of league-average performance to the actual performance. For example, let’s say player A had 0.200 BB/PA after 60 PAs (12/60). The league average BB/PA that year is 0.142 (14.2/100), where the 100 comes from the cross-over point found in Figure 4. So, to predict the true BB/PA skill of player A, you would make the following calculation

Equation 5. Common method of regressing a statistic to the mean. 12/60 is Player A’s BB/PA after 60 PAs. 14.2/100 is 100 PAs of league-average BB/PA=0.142. The sum is the predicted true BB/PA talent of Player A.

and find that player A’s true BB/PA skill of 0.164 is still above league average, but not quite as stellar as their first 60 PAs would have led you to believe. Obviously you cannot perform this analysis without having enough data to accurately constrain the cross-over point. Further, this type of analysis assumes no change in skill or competition like that depicted in Figure 1b.

If I were a decision maker, I would be unsatisfied with this approach that so heavily depends on measuring a player’s stats at just one moment in time. Can’t we learn more by studying the evolution of a player’s stats with each PA? What if a player is improving, as is so often the case in college baseball (Figure 3)? What if our understanding of the league mean is incomplete or biased? And rather than computing one number to represent the prediction of a player’s true skill, wouldn’t it be more informative to calculate a distribution of possible true skills, and thus add a measure of confidence to any conclusion? Shouldn’t I be able to say something like: ‘Based on Player A’s evolution over 100 PAs, I can predict that his true BB/PA skill is 0.165±0.013, and that Player A is likely to be better than Player B 78% of the time.’ I just made up those numbers, but I think this kind of outcome is where we need to be to make our knowledge of RTM useful. We can achieve this goal with a Bayesian approach.

Figure 6. Simulations of the same Bernoulli trial with pₕ=0.3, as in Figure 1, but with Bayesian inference. (a) I start by pretending I know nothing about baseball except that AVG can be between 0 and 1. So my prior is a uniform distribution, which can be described as a beta distribution with α=1 and β=1. (b-f) I generate a new posterior beta distribution by increasing α or β, depending on whether the batter gets a hit or not, respectively. μAVG refers to the mean of the posterior beta distribution, plus/minus the 95% confidence interval. aAVG refers to the actual batting average of the hitter.

Let me illustrate what a Bayesian approach means with the following experiment. I take 1000 batters and put them in a room. Each batter has a skill (in this example, AVG) that I’ll call pₕ (the probability of a hit), just like in Equation 1 and Figure 1. Let’s say I know nothing about baseball, and think those batters are equally likely to have any pₕ in the range of possible AVG values [0,1]. Then I could arrange all those batters from low pₕ to high pₕ and end up with a uniform distribution called the prior distribution (Figure 6a). This prior can be described using a Beta distribution with parameters α = 1 and β = 1.

Equation 6. Mean of the Beta distribution
Equation 7. Variance of the Beta distribution.

Now, let’s say a new batter walks in (I’ll call him J) with no experience, and the only way to to fit him into the lineup (i.e., to know how good he is) is to let him and everyone else have some plate appearances. After his first plate appearance (PA), he does not get a hit (Figure 6b), so we update our belief about his pₕ. Mathematically, this update is done by incrementing the β parameter, which represents the number of failures (outs), while keeping the α parameter, which represents the number of successes (hits) the same. This update gives us a new posterior Beta distribution with α = 1 and β = 2 (Figure 6b). To help your intuition for the linearity of this Beta[1,2] distribution, think of removing batters from the room who can’t have the same pₕ as J as you try to zero in on a measure of J’s skill. You remove 100% of the pₕ=1 batters because J did not get a hit, so he can’t have pₕ=1. You remove 90% of the pₕ=0.9 batters from the room, and so on until you remove none of the batters with pₕ=0 because J did not get a hit and could be like any of them. Check out Bognar (2021)’s excellent applet to help you visualize how the beta distribution changes for different α and β.

After the second PA, if J gets a hit, we update our belief again. This time, the α parameter is incremented by 1, representing a success, while β remains the same. Now, the updated posterior Beta distribution has α = 2 and β = 2 (Figure 6c). The distribution becomes more symmetric, reflecting the increased probability that J has a higher pₕ. This code snippet illustrates how simple these calculations are:

# Import required libraries
import numpy as np
import pandas as pd
from scipy.stats import beta as beta_dist

# Function to simulate batting average (binomial process)
def simulate_batting_average(p, n):
return np.random.binomial(1, p, n)

# Function to perform the Bayesian update
def bayesian_update(alpha_prior, beta_prior, success, trial):
# Update the posterior parameters
alpha_post = alpha_prior + success # Increment alpha by number of successes
beta_post = beta_prior + trial - success # Increment beta by number of failures
return alpha_post, beta_post

# Initialize parameters for the experiment
p = 0.3 # True probability of success (getting a hit).
n = 50 # Number of plate appearances (PAs)
alpha_prior, beta_prior = 1, 1 # Uniform prior (assuming we know nothing)

# Simulate data (1 for hit, 0 for no hit)
successes = simulate_batting_average(p, n)

# DataFrame to store results
results = pd.DataFrame(columns=['PA', 'Mean', 'StdDev'])

# Perform Bayesian updates and store results
for trial in range(n + 1):
if trial == 0:
# Use prior distribution parameters before any observations
alpha_post, beta_post = alpha_prior, beta_prior
else:
success = successes[trial - 1] # Success (hit) in the current PA
alpha_post, beta_post = bayesian_update(alpha_post, beta_post, success, 1)

# Calculate mean and standard deviation of the posterior
mean = alpha_post / (alpha_post + beta_post)
stddev = np.sqrt((alpha_post * beta_post) / ((alpha_post + beta_post) ** 2 * (alpha_post + beta_post + 1)))

# Append the results to the DataFrame
results = pd.concat([results, pd.DataFrame({'PA': [trial], 'Mean': [mean], 'StdDev': [stddev]})], ignore_index=True)

#### Note that most statistics are 0 or 1 coin flip like processes:
# AVG: did you get a hit or not; OBP: did you get on base or not.
# However, SLG and JOPS are weighted statistics that are linear combinations
# of 0 or 1 processes (did you get a single or not? Did you get a double
# or not? etc... I model the Bayesian evolution of all these 0 or 1 processes
# separately, and then combine their individual probabilities into a joint
# probability.

Continuing this process for each subsequent PA, we update the Beta distribution parameters based on whether J gets a hit or not. Each update refines our belief about J’s pₕ (Figures 6d-f). By the end of the experiment, after multiple PAs, our posterior distribution closely approximates J’s true pₕ, balancing the prior knowledge with observed data.

Figure 7. Simulation of the same Bernoulli trial with pₕ=0.3 as Figures 1 and 6. μAVG refers to the mean of the posterior distribution, plus/minus the 95% confidence interval. aAVG refers to the actual batting average of the hitter.

This procedure is an example of Bayesian inference because I inferred the belief about pₕ from a combination of my prior belief and new data. Figure 7 is an animation simulating the evolution of the posterior distribution after each PA for a batter with true pₕ=0.3. You can see that in this particular simulation, even after 200 plate appearances, the true value of pₕ=0.3 is just barely within the 95% confidence interval of my inferred value for pₕ = μAVG = 0.248±0.059. Without updating confidence intervals from sequential plate appearances, I would have underestimated this player’s skill, and probably overestimated my confidence in my estimate of the player’s skill!

Perhaps you can already guess how we are going to use Bayesian inference to decide whether one player is more skilled than another. Instead of adding some number of league-average plate appearances to a hitter’s year-end statistics, let’s leverage the knowledge we gain from the outcome of each PA. Figure 8 shows an example of this Bayesian approach to real data, and illustrates how the evolution and volatility of the statistic with each PA contributes to the mean and uncertainty of the Bayesian inference.

Here I turn to JOPS = 3.3*OBP + SLG, which I have introduced previously as a superb single statistic for predicting runs scored in Division 1 college baseball.

I convert JOPS to a plus-statistic: JOPS+ = (JOPS/μJOPS)*100, where μJOPS is the league average JOPS, and a JOPS+=100 corresponds to the league average.

Figure 8. Instead of an uninformed uniform prior (Figures 6 & 7), I take the league-wide distribution of JOPS from the previous season (the darkest purple curve in (a), μ~1.5) as the informed prior. Think of the prior as the distribution of possible JOPS a player might achieve, weighted toward league average. Then, after each PA, I update the posterior distribution in (a) as the probability density function, and in (b) as the mean and 95% confidence interval of the distribution. Eventually, after Jake Koonin (2B, Princeton) has his last PA, I infer my best guess at the batter’s true skill. The beauty of this analysis is that I end up with a distribution of possible true skills, not just a single number with unknown uncertainty.

I can perform a simulation like the one in Figure 8 for each player nominated for an All-Ivy award. I can take the posterior distribution for any group of players and ask, what percentage of the time does player A have higher true skill than player B — an approach that decision makers could take as well. Figure 9 depicts exactly this analysis. I take the four third basemen nominated for All-Ivy honors, and for each one I use their season-long evolutions of JOPS to infer their true skill.

Figure 9. I perform a simulation with Bayesian inference (like in Figure 8) for each All-Ivy nominated third-baseman (3B). The posterior distributions represent my belief in the skill of each player. The legend reports, from left to right, the number of plate appearances (PA), the actual JOPS+, and the inferred mean±standard deviation of JOPS+. Not surprisingly, for many of these All-Ivy nominated players performing higher than the league mean (JOPS+ = 100), the Bayesian inference predicts declines in their JOPS+ (regression to the mean). That said, the full evolution and volatility with each PA in a player’s JOPS+ contributes to the final estimate of the player’s skill. If I simulate 10,000 seasons for each of these players, Wyatt Henseler is the best player in the group 81% of the time.

In the case of third base, Wyatt Henseler is a clear choice, ranking best in his player group 81% of the time (Figure 9 & Table 1). However, it is not always so easy, and these probabilistic outcomes make it possible to support a choice with quantitative levels of confidence. For example, at catcher, the leader squeaked out the win, ranking first in his group just 32% of the time. Table 1 compares my modeling results to the election results for the nominated player group at each position.

Table 1. For each position, I can repeat the analysis of Figures 8 and 9 to rank the best players at each position. JOPS+ 1st Team represents the player most frequently ranked the best by my Bayesian simulation. Best Player % indicates what fraction of the time that player ranked number 1 in his group after 10,000 simulated seasons.

In Table 1, my rankings are entirely based on JOPS+. On the one hand, JOPS+ is an exceptional predictor of runs generated, and my analysis accounts for the entire trajectory of a players season. On the other hand, I do not account for stolen bases or fielding prowess. Nevertheless, if you rank players by their batting statistics, I think 50% of the 1st team All-Ivy selections were correct, and only 10% of the 2nd team All-Ivy sections were correct (Table 1).

To sum up: (1) While there is a strong tendency to regress to the mean in college baseball, there also is a significant tendency to improve each year. (2) For that reason, I would not want to predict a player’s future performance or true skill by applying standard regression to the mean procedures to statistics that are perceived to have stabilized. (3) Fortunately, there is a simple probabilistic framework out there that can utilize a player’s entire season of data to infer a distribution of possible skill levels. This Bayesian inference allows you to quantitively compare groups of players and predict, with what likelihood, one player is more skilled than another.

--

--

Adam Maloof
SABR Tooth Tigers

Prof. of Geosciences, studies the coevolution of life & climate in layers of rock, works on baseball analytics, shags flies, farms figs & flowers, plants trees