Stepping Up To The Plate — Predicting MLB Player Value & Team Wins with Machine Learning

12 min readAug 31, 2022

If you’re a baseball fan like me, you are more than likely familiar with living and dying by your favorite team each season. A major league franchise has to consider many factors for success. Like any other business, a major league baseball team operates similarly.

For many teams, success begins and ends with the construction of a roster under a budget and translating a roster of players into a championship caliber team. So how does a team decide how much they should be allocating their budget and constructing their roster of players? What are the keys to winning?

Does team salary usually correspond to more wins? Ask the TB Rays and MIL Brewers how it has been working out for them (visualization created using 2021 data).

Can the smaller market teams even compete when they can’t afford the superstar players to help with winning? These are just some questions teams ask in Michael Lewis’s Moneyball: The Art of Winning an Unfair Game and inspiration for this project.

The Goal:

Construct a machine learning model that can predict the salaries of MLB hitters and pitchers using historical baseball data. Additionally, I wanted to better understand which statistics in baseball contribute the most to winning.

Using these models, we can better understand the value of the baseball player and identify which players are potentially undervalued or overvalued when constructing a roster of players.

TLDR: I ended up developing a model that can accurately predict batter and pitcher salaries within $1.8M and $1.6M, respectively. And predict team wins by a margin of 1 win.
I also deployed web applications using Streamlit to predict batter and pitcher salaries and can be found on my Github.

By The Numbers — The Data

Baseball player and team data was sourced using the PyBaseball package which allows users to easily scrape Baseball Reference, Baseball Savant, and FanGraphs data. For this analysis, data was collected from Baseball Reference and FanGraphs.

A total of 6 datasets were scraped, cleaned, and formatted for modeling and consist of the following Basic and Advanced tables:

*Batters:

Basic: 9100 x 28 Features (2000–2021)
Advanced: 3200 x 321 Features (2014–2021)

*Pitchers:

Basic: 9400 x 34 Features (2000–2021)
Advanced: 3900 x 334 Features (2014–2021)

Teams:

Basic: 1600 x 61 Features (1960–2021)
Advanced: 210 x 634 Features (2014–2021)

*Batter and pitcher data was collected for players who had a minimum of 100 plate appearances and 30 innings, respectively.

Basic data consisted of traditional baseball statistics that have been collected for decades such as HRs, RBIs, and SOs. Advanced data consist of over 300 features collected by Statcast, a state-of-the-art tracking technology that allows for the collection and analysis of a massive amount of baseball data, in ways that were never possible in the past.

Since Statcast is a relatively new data collection technology, data was limited to 2014–2021.

When looking at the datasets, it seems like you can measure almost everything in the game of baseball. I’d even venture out saying that the average baseball fan wouldn’t even recognize most of the advanced statistics (I know I didn’t!).

For a comprehensive list describing all the features and statistics, refer to the following documentations links:

Data Visualization

Before performing any data modeling, one of the first steps is to understand the structure of the data by first taking a glimpse at some of the highest paid players in the game today.

In 2021 alone, some of the highest paid batters are receiving contracts in excess of $20M with Mike Trout topping the list at about $37M. There are also some older players like Albert Pujols who made $29M last year after signing a massive long-term contract in 2012.

These current and former superstars of the game can also be interpreted as statistical outliers.

Some of the highest paid batters in 2021 earned massive contracts due to their historical performances.

Similarly, when we look at the highest paid pitchers in 2021, some of the largest markets stand out above the rest. The Los Angeles Dodgers alone, were paying 4 pitchers on their roster in excess of $30M!

Big market cities like Los Angeles, New York, and Boston are handing out massive contracts to pitchers.

However, the average batter and pitcher salary is nowhere near the top echelon of player salaries in baseball. In 2021, the average batter salary was $4.8M while the average pitcher salary was about $3M. Granted, even these average salaries are skewed by the outliers in the datasets.

What is interesting to note is that there appears to be an increasing gap between the average pitcher and batter salary. This gap can likely be explained by a number of factors such as importance of the positions. Batters or positional players, tend to play every day while pitchers do not.

Average MLB batter and pitcher salaries have been on the rise each year.

If You Build It … Data Processing & Understanding

The target variable identified for this analysis was player salary for individual batter and pitcher statistics, and wins for team statistics.

Feature Engineering and Handling

Missing Salaries: Salaries were collected for each individual year an individual player was active.

There were many instances of missing salary values. Missing values were imputed based on prior salaries available for each player.

For example, if Derek Jeter made $15M in 2010, and a missing value existed for 2011, the missing value would be filled with Jeter’s previous year salary of $15M.

Missing Advanced Values: The advanced Statcast data consisted of mostly sparse data (or many data points with zeros) for very specific data collection columns.

For example, the wKN statistic measures how well a batter/pitcher performed against/using a knuckleball. Knuckleballs are rarely ever thrown in baseball and will therefore have many missing values. In this instance, these missing feature values would be filled with zeros.

Adjusting for Inflation: While the dataset was limited to players from the last 21 years, there is some variance in the player salaries across the past two decades. To account for the disparity, player salaries were adjusted for inflation using the national CPI index for years between 2000 and 2021.

Average Salaries: The batter and pitcher datasets were grouped by each individual player’s average salary across all the years that player played. This aggregation method effectively removed categorical features such as position and team played since many players played multiple positions and teams across their careers.

It was discovered early on in the modeling process that categorical variables such as player position and team had a marginal effect on explaining for the variance in player salaries.

Salary Difference Between Years: To further account for variability in player contracts between years, the average difference between player salaries across an entire career was calculated as a feature in the batter and pitcher datasets. This partially addresses the large salary difference for players who make a significant amount more entering free agency.

This feature engineering is defined as the average salary difference. For example, the average salary difference for a player would be $1M if his salary was $14M in 2012 and $15M in 2013.

Feature Selection: To reduce complexity of the model, Sci-Kit Learn’s feature_selection class was implemented and found that the SelectKBest method performed the best when it came down to identifying the most important features and explaining the variance of the model.

Domain knowledge about the game of baseball also came in handy here when selecting features and removing multicollinear features (ie. features that have correlational relationships with each other). Multicollinearity was expected in the datasets and required careful thought into selecting the final features for modeling.

Preprocessing

A 75%-25% train-test split and 5-Fold validation was implemented for assessment on the batting, pitching, and team datasets.
A standard scaler was applied to each dataset to prepare for modeling. The target variable of Salary and Wins were log transformed and reverse logged once the data was passed through the modeling process.

The transforming of data was fairly straight forward since we had no categorical variables and all the missing values were imputed.

Modeling Workflow

As you probably expected, I had a lot of modeling to do! There were 6 total datasets and having an organized workflow was crucial.

To simplify the process, a helper function model_results was constructed to get individual model results consisting of training, testing, and validation scores.

Example of model_results helper function giving me model results in an instant!

Metrics used for determining model performance are the coefficient of determination (R2), and the root mean square error (RMSE).

For further context, the RMSE can best be interpreted as the margin of error of a model’s predictions. In other words, a RMSE of $3M means that the total difference of all errors between the model’s predictions and actual values is about $3M.

A Triple Play — Modeling Process & Results

The modeling process involved establishing a baseline linear regression model for each dataset and attempting to build a successive model to improve upon the baseline.

Models utilized in this analysis include support vector machine (SVM), gradient boost, random forest, CatBoost, XGBoost, and neural network (MLP Regressor); all of which have different methods of regression analysis.

In general, the training and testing datasets were passed through the preprocessing pipeline described above, and the training set was fit to each respective model. Hyper-parameter tuning of each model was performed using GridSearchCV to find the optimum coefficient of determination (R2).

The following summarizes the results for each dataset with the corresponding best performing model based on R2, RMSE, and performance related to underfitting & overfitting:

Batting:

Basic — Gradient Boost Regression (R2 = 0.75, RMSE = ±$1.8M)
Advanced — Gradient Boost Regression (R2 = 0.78, RMSE = ±$2.8M)

Pitching:

Basic — Gradient Boost Regression (R2 = 0.73, RMSE = ±$1.6M)
Advanced — Gradient Boost Regression (R2 = 0.76, RMSE = ±$2.4M)

Teams:

Basic — Linear Regression (R2 = 0.92, RMSE = ±3 Wins)
Advanced — Linear Regression (R2 = 0.98, RMSE = ±1 Win)

Evaluation

Predicting Salaries

When comparing advanced metric tables to basic metric tables, there is no significant difference in the R2 in explaining the variance of data. However, there is a larger relative RMSE likely due to less data points available between 2014 and 2021. As a reminder, basic metrics incorporate more data points between 2000–2021 and will have less RMSE variability.

When predicting batter salaries in 2022, I am able to easily identify players who may be under or over-valued. Deployed through Streamlit.

The best margin of error was less than $2M for batter and pitcher salaries for basic batter and pitcher data. Having manipulated the datasets and performed data analysis, possible explanations to explain for the variance in the data is as follows:

Superstar Outliers: The vast majority of major league players do not make nearly as much as the top 25% of players in baseball. This is illustrated by the average salary of pitchers and batters over the years.

The spread between the salary of the average MLB batter and top players in baseball is so large that the model struggles to make accurate predictions for these outliers. Thus, the models will be more effective training on the common salary and make predictions based on the average player’s salary. When assessing the residual plots, the models actually tend to under-estimate the actual value of a player’s salary rather than over-estimating.

Linear regression plot shown as visual example of predicted and actual salaries for pitchers. (Gradient boost was best model for pitcher salaries). Residual plot on right demonstrates greater positive residuals than negative residuals.

Age Heading Into Free Agency: Throughout the modeling, age was a consistent top contributing factor when determining player salaries. Those that are aware of sports contracts, it is typical to expect players to gradually make more money as they get older.

Once a player becomes a free-agent (ie. players have the ability to sign with any team for any monetary amount), there are many other factors influencing player salaries that are not explained by baseball statistics alone.

These factors could be marketability of the player, social media presence online, and general economic market demand for certain free agent players each year.

Superstars like Aaron Judge are expected to command massive contracts entering free agency as there are very few players like him in baseball today.

Multicollinearity: As discussed earlier, many of the features in each dataset are heavily collinear with each other. In order to reduce multicollinearity and complexity of the model, the most important features were selectively chosen at the risk of reducing R2 for a simpler model.

Predicting Wins

Not surprisingly, a simple multiple linear regression model performed especially well when determining team wins, with a margin of error of 1 win. This was expected as there are strong correlational relationships between simple team statistics.

To win games, a team needs to score more runs than the other team. Conversely, the same team would need to give up less runs to the opponent in order to win. This is evidenced in the model’s feature importance as ER_p and H/9_p represent team pitching statistics related to a team's ability to limit opposing team runs.

Pitching is a dominant key feature to predicting team wins in a season.

It is also clear that timely pitching and hitting wins games as evidenced by the WPA advanced statistic. Win Probability Added (WPA) captures the change in Win Expectancy from one plate appearance to the next and credits or debits the player based on how much their action increased their team’s odds of winning.

It should also be noted that while the model does exceptionally well at predicting team wins based on previous team data, predictions for future years may not be as accurate. This is due to a multitude of factors such as variability in a player’s production each year.

There is no guarantee that the New York Yankees will collectively hit 222 home runs again in 2022 after producing the same amount in 2021. Players will get hurt and miss time, or some other factors may influence a roster’s production.

The expectation is that if all other variables are kept constant will lead to expected team results and expected wins for a season.

Takeaways & Conclusions

“Every strike brings me closer to a home run” — Babe Ruth

Having performed analysis on advanced and basic data for batters, pitchers, and teams, there is no significant difference between using basic and advanced data to predict player salaries.

However, there is a stronger relationship between advanced team data and wins suggesting that building a team around players that excel in advanced metrics can be beneficial.

The batter and pitcher prediction models perform better on players who have not yet reached free agency and would likely garner a massive contract. The models struggle with predicting salaries of superstar outliers and older players who have already signed a large contract. The models are most effective in predicting salaries for players who are likely to go through their first years of arbitration or are playing early in their careers.

When it comes to winning games, the saying is that you could never have enough pitching. So it is no surprise that for our advanced model, simple pitching statistics such as earned runs allowed (ER) and opposing team hits per 9 innings (H/9) are strong features that explain the accuracy of the model. Correspondingly, the wFB feature stands out as a specific pitch type that outperforms all others.

Fastballs are the most common pitch thrown in baseball. However, we are already seeing more pitchers throwing harder and faster and thus, providing a competitive edge against opponents. Faster pitches tend to be more difficult to hit (ie. batter response time to hit a baseball). Target pitchers who excel at throwing fastballs above the average pitcher.

Timely hitting and pitching are other features that have a strong relationship with wins. For example, Shohei Ohtani and Corbin Burnes had the highest WPA (Win Probability Added) amongst all batters and pitchers in 2021.

Building a team around players who excel at advanced metrics such as WPA and pitchers with above average fastballs are likely to produce a successful winning product.

Next Steps

In order to explain for the massive salary outliers, there are many other features to account for as discussed above:

A player’s agent can have a big impact on what type of salary he will command on the market.
Marketability of the player is another factor that would need to be accounted for. While a baseball player can prove his worth on the field, there are other immeasurable factors such as leadership and fan favorability that play a role as well.
Simple market demand for a player can greatly influence a player’s contract due to supply and demand.

Explaining these outlier contracts can be further explored by gathering additional data for players having already reached free agency and analyzing the top 25% of player contracts.

For more information and the details on this project, feel free to check out my Github! I can also be reached with any questions at eric8395@gmail.com.