Using Machine Learning to Predict MLS Salaries

My journey through data science

Published in

Technology Hits

6 min readApr 23, 2021

As I continue my journey through sports analytics, I decided to put some of the skills I’ve learned in the past year to the test. Using the MLS data available at American Soccer Analysis, I tried to use machine learning to predict what the optimal salary for an MLS player should be based on his Goals Added(+g) actions for the season. The model would aim to evaluate players solely on their actions on the field as well as helping teams and players make data-driven decisions when evaluating players in the MLS. This could also be used by players during contract negotiations by quantifying their efforts on and off the pitch.

Data Collection

The data was extracted from American Soccer Analysis’ web application containing all the Goals Added data for MLS players ranging from 2017–2021. To understand where the Goals Added metric comes from, click here for a detailed explanation but in essence, this metric takes into account a player’s total on-ball contribution in all of their attacking and defending actions per game. This actions are classified as: Dribbling ,Fouling, Interrupting ,Passing ,Receiving , and Shooting. Goals Added is the summation of all these actions. From the same web application, all of the salaries (both base and guaranteed salaries)for players in the MLS from 2013 to 2019 were extracted and loaded into Python.

Cleaning and Exploring the Data

The first obvious huddle I found was the fact that the 2 datasets had different ranges. My goals added data frame consisted of 4498 rows × 12 columns while my salaries data frame consisted of only 1447 rows × 7 columns. The discrepancy came not only from the different date ranges, but also the fact that goalkeepers were not part of the G+ data.To fix this, I decided to join both data frames on the attributes of ‘player name’ and ‘season’, while also dropping any NaN on the pandas data frame, resulting in a 725 rows × 14 columns data frame. Moreover, I was debating between using a player’s position as a feature in the model, which is why I decided to create 2 different models and compared their MAE and RMSE. To do this, positions were converted into dummy variables and added to the main data frame.

I then wanted to understand the relationship that existed between my target variable (Base Salary in millions) and the possible features that would go into the model. To do so, I decided to create scatter plots for the features I wanted to use as well as a heat map that examined the correlation to Base Salary(inspired by this great article!)

Intuitive enough, Guaranteed Compensation was almost perfectly correlated but will not be taken into account in the model. There was no significant correlation with the other features.

While the lack of significant correlation meant that the model might not be as accurate as possible, I decided to keep working on the model and maybe tune my features in the future (when I pick up those skills!).

Data Modeling

I decided to start with two different models that used Decision Trees and Random Forests. The reasoning behind this was… those are the only modelling techniques I’ve learned (so far!). The first model would exclude position as a feature to test if there was significant difference between a player’s position and their valuation, even if they had the same goal creating actions in a season. To know the optimal amount of leaf nodes my tree should go down to, I used the following function that compared the different Mean Absolute Error (MAE) for each tree:

def get_mae(max_nodes, train_x, val_x, train_y, val_y):
     model = DecisionTreeRegressor(max_leaf_nodes=max_nodes, random_state = 1)     model.fit(train_x, train_y)
     val_predictions = model.predict(val_x)
     mae = mean_absolute_error(val_y, val_predictions)     return (mae)for max_nodes in [3,5,10,15,20,25,30,35,40,45, 50, 500, 5000,10000]:
     my_mae = get_mae(max_nodes, train_x, val_x, train_y, val_y)
     print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_nodes, my_mae))

This function allowed me to find the optimal amount of nodes that minimized the MAE of this model and my Random Forest Regressor. Here is a snapshot of how I kept improving my first model’s MAE and its evaluation :

MAE when only using sample data: 0.0
MAE when using validation data with no max leaf nodes: 971,768.42
MAE when using validation data with optimal max leaf nodes: 730,165.95
MAE Random Forest with optimal leaf nodes on training data: 721,551.35
Final Model MAE: 644,774.47
R-Squared: 0.2557
RMSE: 802.9785

Evidently this is not the best model. While they both share the similar skewed distribution and behaviour, my initial predictions were too high when compared to the actual MLS salaries. While I was discouraged from this, I tried to see how the model changed when I included a player’s position as a feature. Here is how those initials correlations looked:

Again, there were no significant change in the correlations to a player’s base salary. However, there were some improvements in the accuracy of this model!

MAE when using validation data with no max leaf nodes: 869,241.23(10.5% improvement)
MAE when using validation data with optimal max leaf nodes: 713,996.39(2.21% improvement)
MAE Random Forest with optimal leaf nodes on training data: 701,343.45(2.8% improvement)
Model MAE: 574,693.52 (10.86% improvement)
R-Squared: 0.3986
RMSE: 758.0854

Again, not the best model, but an improvement is an improvement! One thing that should be considered for future models us taking the log of our target in order to have a more normally distributed dataset.

Understanding that I still have miles to go before this model is actually usable, I wanted to try to test how it behaved when new data was inputed. Here is the function I created and intend to use as my models continue to evolve:

def get_salary(dribbling, fouling, interrupting, passing, receiving, shooting, goals_added, am= 0, cb= 0, cm=0, fb=0, st=0, w=0):     inputs=[[dribbling, fouling, interrupting, passing, receiving, shooting, goals_added, am, cb, cm, fb, st, w]]     output = final_model_2.predict(inputs)     return(output)

When inputing the data from American Soccer Analysis’ website, we found the following results for Carlos Vela and Darwin Quintero:

Predicted Salary for Carlos Vela in 2019: $3,546,980.54
Carlos Vela’s Actual Salary in 2019: $4,500,000

Predicted Salary for Darwin Quintero in 2018: $1,639,027.58
Darwin Quintero’s Actual Salary in 2018: $1,650,000

Differences between actual and predicted salaries for other players

Conclusions & Next Steps

With all the limitations and skepticism I explained before, the model seems to kinda work… sometimes. It seems that when trying to predict salaries who are around the league’s average, the model does an acceptable job on making the predictions (ex. Darwin Quintero and Sebastian Blanco), but it struggles to deal with the outliers like Carlos Vela and Zlatan. In order to better understand why this is happening, I believe that I need to continue to test and learn to improve the model by:

-Learning about more effective feature engineering techniques
-Dealing with outliers by either removing them or through other methods like OCC or taking the log
-Keep track of which features improve the model and which ones hinder it
-Learn new modelling techniques and relevant algorithms like LassoCV and RidgeCV
-Continue to learn!

While this is far from perfect, it is still my first ever machine learning model. It is such a great milestone in my sports analytics journey and one I will continue to improve and fondly look back on. Thank you for reading and if you have any advice or comments please do so!

You can always reach me on LinkedIn too!
Cheers,
Juan