Soccer Analytics: Prediction of salary and market value using machine learning (3/3).

Jorge Montaño Casillas
Analytics Vidhya
Published in
8 min readMar 2, 2020
Photo by ALEXANDER IGREVSKY from Pexels

PART III

In this last part, I will explain why I used XGBoost (Extra Gradient Boosting) model to predict the salary and market value of professional female players. The article is divided as follows: In the first part, I explain what is XGBoost; in the second part, the process to set the model; finally, the results are shown.

XGBoost Model

Boosting is a Machine Learning approach that allows you to combine small pieces of rules that alone don’t have remarkable predictive power, but together they make more sense and power.

Imagine the following scenario: you live in a very hot city. You’ve heard that planting trees reduce the temperature of the environment. You decide to plant a tree in your street. Do you notice any change? Probably not.

What would happen if you could fill the street with more trees? Surely, you begin to notice more animals and, of course, shade. Well, this greater range of shade provided by all the trees will allow you to perceive less heat.

This analogy summarizes what XGBoost does when uses weak rules (individual trees) to create a robust rule (set of trees).

Source: Journey to Data Science

In practical terms, the Gradient boosting consists of 3 elements:
• A loss function to optimize
• A weak learning algorithm to make predictions.
• An additive model to add weak learning algorithms that minimize the loss function.

The loss function depends on the type of problem we face, but its main characteristic is that it is differentiable. In a regression problem, we can use the quadratic error, but in classification problems, we can use a logarithmic loss or cross-entropy.

The weak learning algorithm used in Gradient boosting is the decision trees. Decision trees are used to generate real values ​​for each branch so that each result can be added to subsequent models and to correct the errors by averaging each of the predictions.

Something common is to restrict decision trees in terms of the maximum number of layers, nodes, divisions or sheets specifically to ensure that the algorithm remains weak as seen in the following image.

Source: Lucky’s Notes

Decision trees are added one at a time while existing trees do not change. However, the way to determine the parameters for each of the decision trees is defined by a descending gradient that will minimize the loss function.

In this way, trees with different parameters are added in such a way that the combination of them minimizes the loss of the model and improves the prediction.

The decision trees that use Gradient boosting are very powerful models and used for supervised learning problems, but one of their main disadvantages is that they require careful adjustment of the parameters and a lot of training time.

XGBoost (Extreme Gradient Boosting) is the algorithm that has recently been mastering Machine learning problems in Kaggle (https://www.kaggle.com/) competitions with structured data.

XGBoost is a decision tree implementation with Gradient boosting designed to minimize execution speed and maximize performance. The fact of working only with numerical data is what makes this library so efficient.

Since our final base consists of numerical data on all the players present in the FIFA 19 database, the XGBoost model was, of course, the best model that could be used to predict the salary and market value of female players.

Source: EA Games

Setting the model

As we mentioned previously, Gradient Boosting involves creating and adding decision trees to a set model sequentially. New trees are created to correct residual errors in the predictions of the existing set.

Having several models together to form a very large and complicated one could cause to have the risk of overfitting.

Source: GeeksforGeeks

XGBoost requires that information be presented in a particular way called Dmatrix that is nothing more than a linear transformation of the data.

The parameters it uses are max_depth (maximum depth of the decision trees that are being trained), objective (the loss function used) and num_class (the number of classes in the data set). The eta parameter has a special interpretation because it allows us to avoid the problem of overfitting since it can be considered as a learning rate.

Instead of simply adding the predictions of new trees to the whole with the full weight, the eta will be multiplied by the waste that is added to reduce its weight. This effectively reduces the complexity of the general model.

It is common to have small values ​​in the range of 0.1 to 0.3. The lower weighting of this waste will still help us train a powerful model, but it will not allow that model to escape to a deep complexity where overfitting is more likely to occur.

We will start by creating a train-test-split where the training takes 80% of the data and the test the remaining 20% ​​for both the market value prediction model and the market salary model.

Once the parameters have been defined, it is necessary to train the model

Once we get the results of the model, we now seek to obtain the best parameters to retrain the model and get a better prediction.

With the XGBoost model with the best parameters, it is now possible to predict the market value of any player.

Because XGBoost uses only numerical data, it is not possible to know the name of the players. However, this model allows us to know 10% of our base with overvalued and undervalued market values.

This type of information is relevant to try to find out what factors the market takes into account when paying for a player. Similarly, the model for predicting market salary is presented below.

Again, we will use a trained model with the best parameters found

We can also analyze those players with extremely high salaries and those who are well below the level they should have.

Results

The model performed 32 iterations before finding the best possible values. Subsequently, the model was retrained with these parameters to obtain the best possible prediction.

The power of XGBoost was evidenced to strengthen each of the weak qualifiers that will give me the best possible forecast on the target variables.

One of the advantages of working with XGBoost is that it allows you to work with a large number of variables, which will be of great help when, in a second stage, this project has a larger and richer DataFrame.

Once I had the trained model and with the final forecaster, I took the information of the 15 players for what their theoretical values ​​would be in terms of salary.

Before describing the results, it is important to mention that in Mexico, a professional female player receives, on average, 4,200 pesos per month while a male player earns 635,000 pesos per month.

The results for the 15 players regarding their salary are:

On the other hand, the market value of these players should be around $ 377,810,915 (€17 million), $ 463,100,000 (€22 million) and $ 189,450,000 (€9 million).

Unfortunately, there is no information on market value to know how close or far from reality this exercise is for female players in Mexico. If this exercise had been carried out for men, the immediate comparison would be to look for salary and market value in Transfermarkt.

The next step for this project is to do web scraping on the web page of Liga Mx Femenil to get as much information as possible and then predict the physical characteristics of each of the female players.

Thus, by matching these values ​​with the salaries and market values ​​of each of them, it could help them as a guide to negotiate better working conditions.

Source: AS Mexico

Finally, I would like to thank my teachers at Ironhack: Yonatan, Oscar, and Vania who always took us to the limit. On the other hand, I cannot fail to mention and thank the friends I encountered during the course Reynaldo, Ernesto, and Roger. Thank you for everything, brothers. To the rest of my classmates, I extend my recognition and appreciation for so many good times.

To all my family, Pita, Henry, Grace, and Colmillo, to my friends Carlos, Luisito, Pedro, for the support and motivation. To Pamela for being an inexhaustible source of inspiration.

If you missed the first part, you can check here http://bit.ly/31Ft45d, the second part here http://bit.ly/2Tb1PwR and the final project here https://jmcass.github.io/SportsAnalytics/ index.html

Thanks for reading and sharing!

--

--

Jorge Montaño Casillas
Analytics Vidhya

Economist passionate about sports, numbers, finance, music and Python