Predicting MLB Arbitration with Tensorflow

7 min readAug 10, 2023

Introduction

One of the stages of an MLB player’s career is salary arbitration. Since the 1970s, players with between three to six years of service time can attend arbitration hearings against their team should both parties fail to agree to a new contract. Though an exception overall, each year a small but non-trivial amount of cases end at a hearing. There, both player and team argue for their desired salary, concluding with a final decision by an appointed panel.

In recent years, players have been able to receive an extra year of arbitration through “Super Two” status, where players are above a set threshold of service time but still under three years. On the other hand, teams can control who they negotiate with from the ability to “non-tender” arbitration-eligible players. A non-tendered player becomes a free agent before completing their six-year rookie contract, and can sign contracts to play in the majors or minor-leagues. A major-league contract would be for a guaranteed salary, while a minor-league contract would pay the minimum salary for time spent in the majors.

Data Gathering, Cleaning, Preprocessing

To try and predict annual salary arbitration, I built a neural network with the Keras API that returns annual salaries based on in-game and career-related factors. The training and test data gathered includes arbitration, gameplay, and career-related statistics for all arbitration-eligible players from 2011–2023. Descriptions for all variables used in both models can be found here.

The main steps of the data cleaning process were as follows:

Cleaning

Changed position of multi-position players to primary position
Changed names to be merged with Fangraphs formatted names
Reduced service time from “years.days” format to year
Converted innings pitched (IP) from .1 to ⅓, 2 to ⅔
Converted strings with percent signs to floats

Merging

Merged arbitration salary and Fangraphs statistics by name and previous season
Added previous year salary as feature
Handled missing values by looking up salaries. Remaining missing values were imputed with the league minimum salary specific to each season.
Created training set and test set, split by position players and pitchers

Training: 2011–2022 arbitration and performance data

Dimensions, Position Players: 875x79
Dimensions, Pitchers: 1120x83

Test: 2023 arbitration and performance data

Dimensions, Position Players: 89x79
Dimensions, Pitchers: 140x83

Preprocessing was aided through visualizing the distribution of all continuous variables. To do this, I created histograms for each variable to see how each one was to be appropriately scaled. In short, service time and position were both one-hot encoded, all Pareto/non-normally distributed variables were MinMax scaled, and the rest were normalized through z-scoring.

Histograms for Continuous Features, Position Players Training Data

Histograms for Continuous Features, Pitcher Training Data

Building The Model

The network I chose to build was simple in its design: a pyramid structure with an input layer, three hidden layers of 100, 200, and 100 neurons respectively, and a single neuron output. Each hidden layer used the ReLU activation function and was regularized with 20% dropout of nodes. Given the small amount of training samples, pruning or fine-tuning was unnecessary.

Each model was trained at a learning rate of 0.0005 for 300 epochs and evaluated with metrics Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). While Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) is standard for regression problems, I preferred MAE and MAPE. For one, since few players will receive much larger salaries relative to the whole, MAE and MAPE should be superior measurements of model accuracy with consideration to outliers. In addition, MAE is an easier statistic to interpret when estimating the average deviation of model-predicted salary from true salary.

Model Results

The combination of learning rate, epochs, and hidden layers seemed to be a good fit. From inspecting both networks’ learning curves, both models appeared to begin plateauing in terms of loss without quite overfitting. The validation set’s loss continued to decrease with the training set loss, and small differences in loss between both persisted throughout training.

Model Learning Curves, Position Players (Left) and Pitchers (Right)

To understand feature importance, I found the network’s Shapley values assigned to each feature based on its relative contributions to a model’s final predictions. The two most positively important features for position players and pitchers was previous salary and whether or not a player had four or five years of service time. Among other positive features were age, measures of playing time, and specific statistics that measures skill. For position players, plate discipline stats such as Swing%, Contact%, or BB% returned higher values, while those significant for pitchers included GB%, SwStr%, and Hard%.

Model Feature Shapley Values, Position Players (Left) and Pitchers (Right)

Both models were tested by predicting salary for the 2023 arbitration class, rounding each to the nearest multiple of 10,000 for clarity. I also created multiple scatterplots of true and predicted salaries, grouping each data point by position or service time. In general, most data points were clustered in the $1 million to $6 million range for true and predicted salary, while the few players outside the cluster carried four or five years of service time. In terms of position, those same players were either starting pitchers, first basemen or some outfielders. On the other hand, predictions for relievers, second basemen, and catchers were with few exceptions in the general cluster of data points.

True vs. Predicted Salary Scatterplots, Position Players, Grouped by Position (Left), Service Time (Right)

True vs. Predicted Salary Scatterplots, Pitchers, Grouped by Position (Left), Service Time (Right)

From inspection of both models’ predictions, specific players constitute the largest differences between true and predicted salary. These include either older players or journeymen such as reliever Jorge Alcala (28, 3 years), Jose Castillo (27, 3 years), and backup catcher Tom Murphy (32, 5 years), all of whom have predicted salaries higher than true salary. Also, players in their first or second arbitration year who have played well through their short careers are predicted for lower salaries, including catcher Will Smith (28, 3 years) and first baseman Vladimir Guerrero Jr. (24, 3 years). For these cases, age acts in favor of predictions for the older players, while fewer years of service time leads to under-predictions for players like Smith and Guerrero Jr.

Finally, both models show Shohei Ohtani as an outlier, projecting him north of $10 million despite a $30 million salary. This makes sense since Ohtani is a two-way player, so both models have no knowledge of half of his input data. However, even if individual predictions were summed, the difference between true and predicted salary would still be over $6 million.

Excel Tables, Highest Absolute Differences of True and Predicted Salary, Position Players (Left), Pitchers (Right)

Limitations, Conclusion

In sum, predicting MLB arbitration salaries is achievable for many players. However, limitations become clear when predicting specific players based on only one year of performance. From this exercise, it could be concluded that arbitration is better suited for time-series based methods. If so, each player in each year of arbitration would have all previous salaries and previous performance statistics as input data. Currently, each data point includes salary and statistics from the previous year, but does not have access to data from before. Therefore, neither model has knowledge of a player’s career trajectory that leads to a target salary. Remedying this may correct some of the flaws in prediction outlined above, but revisions in both input data and model selection are needed.

In addition, more and better quality data is necessary to understand arbitration as a process. However, there is currently no place to find arbitration data from 2010 and before. More time spent on finding ways to aggregate past arbitration data is needed, with an ideal product being data for all arbitration-eligible players for every year since arbitration’s inception.

For more information on the full code, data, and visualizations, all can be found at my Github repo.