Baseball Analytics

A Beginner-Friendly Intro to Data Science in Baseball

… with a beginner.

Lucas Goldfarb
20 min readJan 12, 2024
DD5_3198, DD5_9774, DD5_2661 by All-Pro Reels, available under a Creative Commons License 2.0 at https://www.flickr.com/people/joeglo/

Table of Contents

Introduction
Preparing the Data
Setting Up The Model
Training the Model
Evaluating the Model
Improving the Model
Introducing New Metrics
Model Observations
Further Investigation
Looking to 2024
Final Thoughts

Introduction

With the offseason off to a hot start, lifelong fans and casual observers alike have presumably heard the news of Japanese superstar Shohei Ohtani’s stunning $700 million, 10-year deal with the LA Dodgers (per MLB.com).

For anybody still unfamiliar with Shohei Ohtani, he is a generational talent known for singlehandedly overpowering opponents with both his bat and arm. You don’t need to look beyond baseball’s most basic metrics to behold Ohtani’s immense value. Hence the record-breaking sum he’s been promised.

However, for the majority of Major League Baseball teams, spending hundreds of millions of dollars on top talent isn’t an option. Instead, building a team via successful draft picks, calculated trades, and targeted acquisitions is the only way to compete with big market teams. Utilizing advanced analytics becomes crucial here, as it helps these teams to make the right decisions and project future outcomes effectively.

This analysis involves delving into metrics that extend beyond the basics such as ERA (Earned Run Average), HR (Home Runs), or SV (Saves). One such metric and one of the most highly touted ways of determining player value is Wins Above Replacement, or WAR.

It’s important to define ‘Replacement.’ This just refers to the output expected from a deep bench player or minor leaguer, which we set at 0. For example, a player with a WAR of 2 contributed to approximately 2 additional wins for their team over the season. While this metric isn’t exact, it offers a quick gauge of a player’s overall value.

I won’t go into extreme detail as WAR is calculated differently for different positions, but here’s a general outline for position players and starting pitchers:

Image by Author (Per Fangraphs)

In this project, I’ll develop a machine learning model to predict WAR for pitchers in the 2024 season. Additionally, I want to see what features the model values highly for its predictions.

My goal is to apply my budding CS education to the realm of baseball, and perhaps in doing so, even assist my favorite team, the Baltimore Orioles, in making some savvy pitching acquisitions this offseason.

I want to thank the Dataquest YouTube Channel for providing the majority of the underlying code and the basic idea for this project. The Channel’s videos are great for people just trying to get into machine learning and data science.

Preparing the Data

First things first: we need some data. We’re going to get this from an extremely helpful Python library called pybaseball. Pybaseball did all the web scraping for us and placed it into an easily installable package that we’re going to use extensively for the project.

We’ll go ahead and install some other necessary libraries, as well. I won’t get into the details of each one but os, pandas, and numpy are all common machine learning modules in Python.

Next, we want to use the pitching_stats function to pull statistics for every player who pitched at least 50 innings (shown in the qual parameter) in a season within the time range I picked below.

After we store the data in a csv file for easy access throughout the project, we group our data by IDfg which is a unique ID given to each player by Fangraphs and then filter out the players who have only one season. If we’re looking to make predictions on data for future seasons, we’ll need players with more than one season to train the model.

import os
import pandas as pd
import numpy as np
from pybaseball import pitching_stats

START = 2002
END = 2023

pitching = pitching_stats(START, END, qual=50)
pitching.to_csv("pitching.csv")
pitching = pitching.groupby("IDfg", group_keys=False).filter(lambda x: x.shape[0] > 1)

We can call our data frame, where we see nearly 6,500 rows of data and 393 columns (each one accounting for a different statistic).

If the point of our project is to predict WAR, we will need the WAR from the next season for each player. This will make it easier to train our model to be able to quickly access that value.

In the code below, we create a new column Next_WAR and shift the WAR from the next season up to the previous season, which we then apply to our data frame.

def next_season(player):
player = player.sort_values("Season")
player["Next_WAR"] = player["WAR"].shift(-1)
return player

pitching = pitching.groupby("IDfg", group_keys=False).apply(next_season)

We can see above that our data is filled with NaN (Null/Not a Number) values, which are caused when earlier seasons didn’t have access to more advanced metrics that have been introduced recently. Our ML model will not work with these null values so we need to remove them. We can get the number of NaN values in each column with the code below.

null_count = pitching.isnull().sum()

We go ahead and create a list of the statistics that don’t have null values anywhere in the data and call them complete_cols. Our Next_WAR is removed in the process as expected (think about what happens to a players most recent season WAR when we shift it down), so we need to add it back.

complete_cols = list(pitching.columns[null_count == 0])
pitching = pitching[complete_cols + ["Next_WAR"]].copy()

Much better. We’ve cut our original number of features by 2/3.

The next issue is the presence of metrics that cannot help us because of their data type. We can find and remove these as they’ll only confuse the model during training.

pitching.dtypes
pitching.dtypes[pitching.dtypes == "object"]
del pitching["Dollars"]
del pitching["Age Rng"]
del pitching["Team"]

Finally, the last step in cleaning our data is to drop the seasons where Next_WAR is null. This would just be the most recent season or the last season of a player’s career. We’ll keep a copy of the original data frame for the future but in order to train we need to get rid of these rows.

pitching_full = pitching.copy()
pitching = pitching.dropna()

Setting Up The Model

We’re going to use a ridge regression. In the simplest terms, ridge regression is a form of linear regression (remember putting a line through a bunch of points?) in which we try to prevent a common problem called overfitting. Think of overfitting like memorizing the specifics of practice questions without really understanding the subject, leading to mistakes on a real test.

Ridge regression tackles this by applying an ‘adjustment’ to the model, preventing it from focusing too much on any single feature in the training data. This adjustment is a bit like adding a balancing weight to keep our model from leaning too heavily on one feature, which helps it perform better on new, unseen data.

The strength of this adjustment is controlled by something called the alpha value. A higher alpha adds more of this balancing weight, making our model less likely to overfit by reducing the influence of any single feature. For example, an alpha of zero is simply regular linear regression. But as we increase alpha, the model becomes more balanced and less likely to get tripped up by oddities in the training data.

Image by Author

We import the model from another ML library, scikit-learn, where we will also grab the SequentialFeatureSelector and TimeSeriesSplit functions.

from sklearn.linear_model import Ridge
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import TimeSeriesSplit

rr = Ridge(alpha=1)

split = TimeSeriesSplit(n_splits=3)

sfs = SequentialFeatureSelector(rr, n_features_to_select=20, direction="forward", cv=split, n_jobs=4)

The feature selector works by trying different combinations of features to find the best combination for the model.

In our case, we do this by starting with zero features and moving ‘forward’ to slowly pick up one feature at a time. At the end, we will hopefully have the 20 most relevant statistics. This will also play a role in keeping the regression model from overfitting (relying too heavily) on less helpful features.

Image by Author

The TimeSeriesSplit function divides our dataset into two segments: a training set and a test set, specifically tailored for chronological data. In the training set, we provide both the inputs (features) and their corresponding outcomes (Next_WAR), allowing the model to learn and understand patterns. The test set, composed of data points that the model hasn’t seen before, is used to evaluate the model’s predictive accuracy by seeing how well it can forecast outcomes.

Unlike most ML projects where data can be randomly split, the TimeSeriesSplit method is essential here; it ensures the model is trained on earlier data and tested on later data to maintain the chronological integrity of the dataset.

Image by Author

Our dataset presents a challenge due to the different scales across each feature. For instance, while the typical ERA is around 4.00, a pitcher might record 100+ IP (Innings Pitched) in a season, creating a significant disparity in scale. This can skew the model’s performance, as larger values may disproportionately influence the predictions.

To address this, we standardize the data by scaling all values to a range of 0 to 1. In our process, we selectively exclude certain columns that need to maintain their original scale and only apply this scaling to the remaining columns.

removed_columns = ["Next_WAR", "Name", "IDfg", "Season"]
selected_columns = pitching.columns[~pitching.columns.isin(removed_columns)]
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
pitching.loc[:, selected_columns] = scaler.fit_transform(pitching[selected_columns])

Now that everything is ready, we want to use the SequentialFeatureSelector to grab the 20 most important statistics for predicting the Next_WAR column. We put the results in a list and call them our predictors.

sfs.fit(pitching[selected_columns], pitching["Next_WAR"])

predictors = list(selected_columns[sfs.get_support()])

Training the Model

The next step in our process is to actually train and test the model, which we’ll do by creating a new function. The idea behind backtest is that we want the model to train on prior seasons, and then use that training to predict the results of the very next season. We then include that season in the training and move forward one.

At each iteration of the loop we have our ridge regression make a prediction. We can then store these results.

def backtest(data, model, predictors, start=9, step=1):
all_predictions = []
years = sorted(data["Season"].unique())

for i in range(start, len(years), step):
current_year = years[i]

train = data[data["Season"] < current_year]
test = data[data["Season"] == current_year]

model.fit(train[predictors], train["Next_WAR"])
preds = model.predict(test[predictors]) #returns a numpy array
preds = pd.Series(preds, index=test.index) #we do this to make it a Series
combined = pd.concat([test[["Season", "Next_WAR"]], preds.rename("predictions")], axis=1) #combine the Series
combined["actual"] = combined["Next_WAR"]
combined = combined.drop("Next_WAR", axis=1)

all_predictions.append(combined)

return pd.concat(all_predictions)

predictions = backtest(pitching, rr, predictors)

And there we have it, predictions for over 2,700 player seasons and right next to each one is the true value.

Evaluating the Model

Great. We’ve trained our model and made some predictions. The next logical question is, “So did it work?” We determine this by first getting some details about the values we are trying to predict, most importantly the standard deviation (the average distance from the mean) of the dependent variable.

pitching["Next_WAR"].describe()

One way of telling if our model is effective is to find the root mean squared error (RMSE) and compare it to this standard deviation.

RMSE is the average error of the predictions, but to account for negative values we first square the values and then take the root. If our error is less than the deviation, it indicates that our model is more accurate than picking essentially at random.

We create a method to calculate this and make sure to get rid of the square.

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(predictions["actual"], predictions["predictions"])
rmse = mse ** 0.5

OUTPUT: 1.1693872742080313

This leaves us with an error 25% less than the standard deviation. Decent. Now we have two options: take what the model delivered and move on OR try some different techniques to improve its accuracy.

Improving the Model

If you were leaning towards the former, I have some bad news for you and it starts with two words: data visualization.

What we have so far doesn’t give us much of an idea of what is happening in the model, and without more information it’ll be difficult to improve it.

In the Python data science stack, the two most common libraries for visualizing data are seaborn and matplotlib. The first thing we want to do is get a visual of how our model’s predictions compare to the actual WAR over time.

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))

sns.lineplot(x='Season', y='actual', data=predictions, label='Actual WAR', marker='o')

sns.lineplot(x='Season', y='predictions', data=predictions, label='Predicted WAR', marker='o')

plt.title('Actual WAR vs Predicted WAR Over Time')
plt.xlabel('Season')
plt.ylabel('WAR')
plt.legend()

plt.show()

The result is a bit different than we might have expected. The mean WAR from 2011 to 2018 is fairly consistent, right around 1.3–1.5. We then see a somewhat minor down spike in 2019 which comes before we see a significant increase in 2020.

This seems counter-intuitive. If you look at pitcher stats in the COVID-shortened season, the maximum WAR was Shane Bieber’s 3.1, compared to Corbin Burnes’ 7.5 the following season. Why is the average WAR for 2020 so much higher?

Image by Author

The answer to this is going to lie in how we gathered our data. Since we set our minimum IP to 50 to be included in the data, this shortened season saw significantly fewer qualified pitchers, and despite the max values being much lower, there were approximately 300 fewer pitchers included for the 2020 season.

As a result, the smaller range of qualified players leaves out a lot of “inning-eater” pitchers who would typically meet the inning threshold throughout the slog of a normal 162-game season. This becomes clear with a histogram comparing the 2020 and 2021 seasons.

In general, the 2020 season is not good for our model. A shortened season is more susceptible to hot or cold streaks by pitchers who might typically see a decline, or improvement over the course of a full year.

Additionally, we have seen how stats such as WAR are being misrepresented and are potentially throwing off the training.

There are a couple potential solutions to our problem.

The first is to try to account for the average WAR discrepancy by lowering the innings requirement. The problem with this is that lowering to the necessary threshold will lead to including pitchers that had insignificant stats due to such a small amount of playing time.

Despite not wanting to lose more data points, it is best that we omit the 2020 season entirely. Let’s try it.

pitching = pitching[pitching['Season'] != 2020]

We can make the adjustment by filtering out all data from the 2020 season. Leaving nothing else changed, the graph looks more like we expected.

Now that’s all well and good, but you may be wondering what this did for our predictions.

RMSE WITHOUT 2020: 1.1455012871254722

By training again, we see that our predictions improved a bit. The average error dropped from 1.169 to 1.145. This is a good start, but we have more factors to look at when it comes to improving and understanding the model.

Introducing New Metrics

The first is a different statistical metric we can use to evaluate the validity of the system.

R-squared is a measure of how well the independent variables explain the variation in the dependent variable.

Imagine you’re trying to figure out why your dog barks so much.

First, you plot the amount of barking against the amount of times the door bell rings throughout the day. We find that the more the door bell goes off, the more Fido barks. This demonstrates that the two are probably related and therefore this metric will have an R-squared value closer to 1.

Next, we plot the amount of barking to the number of clouds in the sky each day. These two things are quite unrelated, so you should expect a lower R-squared value. Overall, R-squared tells us how well our specific variables explain the outcome.

Image by Author
from sklearn.metrics import r2_score

r_squared = r2_score(predictions['actual'], predictions['predictions'])

OUTPUT: 0.39469760522024366

Now, when we calculate the R-squared for our results, we find that our R-squared is right around 0.39. This means that our features and their respective weights account for roughly 40% of the variation in the data.

While this might be considered low in some fields, sports are inherently difficult to predict due to factors beyond player stats — from injuries and strategy shifts to weather.

Regardless, we should use RMSE in tandem with R-squared to see both prediction accuracy and to assess the validity of our predictors.

Next, I want to get a clearer picture of these different features and how much of an effect each one had. We can do this by plotting a horizontal bar chart including each coefficient, once again using seaborn and matplotlib.

feature_names = predictors

coefs_df = pd.DataFrame({
'Feature': feature_names,
'Coefficient': rr.coef_
})

coefs_df = coefs_df.reindex(coefs_df.Coefficient.abs().sort_values(ascending=False).index)

plt.figure(figsize=(10, 8))
sns.barplot(x='Coefficient', y='Feature', data=coefs_df)
plt.title('Coefficient Plot of Ridge Regression Model')
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.show()

The coefficients represent the ‘weight’ given to each stat by the model. The greater the absolute value, the more of an impact it had.

The positive values suggest that these coefficients are positively related to the Next_WAR. For example, we see that higher SO (Strikeouts) indicates a greater WAR with the opposite being true for a feature like Age, suggesting that WAR decreases as players get older.

This is a good start to see how important each feature is to the model.

A lot of it checks out immediately.

We see that Soft%+ (percentage of balls hit into play hit softly) positively influences WAR. This makes sense as pitchers who induce weak contact are more likely to get outs leading to more wins. The opposite of course being true for Hard%+, which has a negative impact on the WAR prediction.

These are the features that “make sense” but as we look closer, some features may not exhibit the behaviors we expect (more on some specifics of this later).

Overall, certain aspects of these results may lead to some confusion when it comes to interpreting them.

Oftentimes in predictive modeling, features that are highly correlated (collinear), can inflate certain patterns and end up masking the individual impacts of the variables involved. Similar to having multiple people talking to you at once and trying to discern just one of them.

This leads us to the next statistical measure we should look at.

One way to spot problematic features is by calculating the Variable Inflation Factor (VIF). VIF puts a number behind what we’re talking about. The more severe the multicollinearity, the higher the VIF. We can use a new library called statsmodels to give us these values for our data.

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = pitching[predictors]
X = sm.add_constant(X)

vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

The initial results present some concern. If our goal is to keep the VIF below 8 for each variable, our major focus should be on GS (Games Started) and SO with scores of 17.5 and 14.2, respectively. Variables like WAR and MD (Meltdowns) are a bit high but not the immediate worry.

There are a few ways to go about this, but the most straightforward solution is to remove the problem variables. It’s necessary to compare the effect of the feature on the model relative to their VIF. SO has a high VIF but is weighted heavily in the regression. Therefore, it might make more sense to take GS out. Let’s try it.

By adjusting the parameters for removed_columns, we can re-run our system and get our new VIF results.

#we need to go back in our code to do this
removed_columns = ["Next_WAR", "Name", "IDfg", "Season", "GS"] #add GS
selected_columns = pitching.columns[~pitching.columns.isin(removed_columns)]

Much better.

We can see that nearly all scores up and down the board decreased to within the desired range.

While we could continue to experiment with removing or editing different features, this is a substantial improvement and messing around with removing more variables could lead to confusing the model and reintroducing multicollinearity.

Additionally, our RMSE and R-squared have remained stable post-adjustment, allowing us to feel a bit more confident in our chosen features and our interpretation of them.

Now, let’s go back and take a look at some features that aren’t making much sense at first glance, especially now that we’ve made some changes.

Model Observations

One of the first things we notice when looking at the bar chart is the prominence of HR/FB%+ (Home Run to Fly Ball Percentage Plus), a derivation of HR/FB, a stat that represents the ratio of fly balls given up by a pitcher that become home runs.

For those unfamiliar with advanced baseball statistics, stat trackers have somewhat recently started the use of ‘+’ stats. To quote from the Fangraphs website on the topic

“ … these stats have a baseline of 100, where the number above or below 100 is the percentage above or below average a player is. For instance, Pedro Martinez’s 1999 K%+ is 239, that means he was 139% above the league average.”

This type of metric has gained the favor of analysts as it provides a greater picture in a singular number than the traditional version of the stat. It also puts everything on the same scale making it easier for a machine learning model to pick up on patterns with less confusion.

Putting this back into context, a HR/FB%+ value exceeding 100 indicates that the pitcher surrendered home runs per fly ball at a rate higher than the league average by that specific percentage.

If you’re like me, you hear this and think “Why would giving up more homers relate to a higher predicted WAR?” I originally overlooked this result and chalked it up to there being some sort of confusion in the model. However, upon further research we can uncover some secrets that this metric holds.

HR/FB is not just seen as a way to track how often pitchers are giving up a big hit — analysts see it more as an indicator of luck.

The league average HR/FB percentage is said to be right at 10% and despite some pitchers having consistently high numbers throughout their career, most pitchers are right at about the average in giving up homers.

Image by Author

The left table shows the lowest HR/FB ratio for the 2019 season where quite a few players were able to sustain rates below league average. But, the table on the right showing the lowest HR/FB values from 2015 to 2019 suggests that time brings everyone back to around the average.

The rationale behind this is that the difference between a deep fly ball caught for an out and a game-changing home run is highly dependent on various factors — whether this is different park dimensions, weather conditions, quality of defense, or something else.

Although a pitcher might experience a season with an above average HR/FB ratio, consistent player data suggests that over time this value will regress toward the mean.

Let’s see if, by comparing HR/FB%+ for pitchers over multiple seasons, we can uncover any revealing trends.

If we index our data frame for the 20 worst HR/FB%+ scores in the 2022 season and sort in descending order compared to WAR in 2022 and 2023, we can get a better picture.

pitching_2022 = pitching[pitching['Season'] == 2022]

pitching_2022_sorted = pitching_2022[['Name', '2022 HR/FB%+', '2022 WAR', '2023 HR/FB%+', '2023 WAR']].sort_values(by='2022 HR/FB%+', ascending=False)
pitchers_2022_sorted.head(20)

Looking at the results, only 3 players didn’t see an increase in WAR, 2 of whom held the same HR/FB%+, and the third was the only one who decreased their ratio and didn’t improve WAR. Every other player improved on both of these metrics.

Notably, young players such as Kyle Bradish and Yusei Kikuchi had breakout seasons in 2023.

Then, of course, veteran staples such as Gerrit Cole and Nathan Eovaldi were able to bounce back in 2023 after less impressive performances in 2022 — both lowering their HR/FB%+ significantly.

While there are certainly other factors that we didn’t consider in these hasty observations (IP for example), the results are still intriguing.

In the future, when analyzing pitchers who have struggled and trying to predict who will improve in future seasons, this metric may warrant some investigation to see if a player might just be down on their luck.

Further Investigation

Continuing in our pursuit to understand this model, we have another major issue with the content of our data. In baseball there are two main types of pitchers:

  1. Starters, who begin the game, and
  2. Relievers who come in after them.

Relievers and starters are treated differently in their WAR calculation, notably that relief pitchers typically are allotted less WAR.

The first problem stems from the concept of G (Games played) vs GS. In our bar chart, we saw that G had a negative effect on WAR. This is because despite having lower WAR, relief pitchers appear in many more games than starters.

As we can see in the tables below this could be misleading the algorithm into thinking more games leads to lower WAR.

Top 15 in Games Played for Starters and Relievers

The second problem is that ‘reliever’ stats such as SV, SD (Shut Downs), and HLD (Holds) are typically indicative of a good reliever, however, every starting pitcher is going to have flat zeroes for these metrics. This could throw off our model by using these stats to make predictions about the wrong pitcher types.

The best solution I could think of for this problem given our data is to include a new feature that keeps track of pitcher type — giving the model more info that it could hopefully pick up on. The issue now being that a ridge regression algorithm lacks the complexity needed to make these observations.

Models such as XGBoost or Random Forest might be able to better capture these intricacies. However, looking forward, it would likely be best to separate starters and relievers into different models to avoid these issues altogether.

Looking to 2024

Well there we have it. We did some data preprocessing, prepared our model, trained it, tested it, took a look at some statistical measures, made some adjustments, tuned some features, and made a few observations.

Using all of this information, I want to go ahead and make some predictions for next season.

Being that we are currently in the middle of the offseason I thought it would be a good idea to see who in the market right now is slated to perform the best, as decided by the regression model.

For time’s sake, I won’t include the code here (check out the GitHub link below).

And there we have it. My completely full-proof predictions for the 2024 season. I’ll consider it a success if the direction (either negative or positive) is correct for the pitchers on this list, but I certainly wouldn’t put money on young stars such as George Kirby regressing next season.

FGDC scores per Fangraphs

Here are my predictions for each player next to the predictions by Fangraphs’ own model (FGDC: FanGraphs Depth Chart) for next season to give some better perspective.

FGDC uses a combination of two of the most widely respected projections models, Steamer and ZiPS, to give its own predictions.

Comparing the two, we have the same idea for 7 of the 9. FGDC feels quite differently, however, for Corbin Burnes and Shane Bieber, foreseeing improved seasons for both.

Final Thoughts

While analytics are a huge part of the game, at the end of the day, sports are largely unpredictable.

Stuff happens.

Regardless, analysts (and ‘wannabe’ analysts such as myself) will continue using data to attempt to forecast the game in search of those under-the-radar players who might just give your team the edge needed to win against the modern Goliaths of the sport.

If you’re interested, feel free to take a look at my GitHub which has all the files (including graphics) from the project :)

--

--

Lucas Goldfarb
Lucas Goldfarb

Written by Lucas Goldfarb

Georgia Tech | Computer Science 2026 | Keep Up the Good Journey