Soccer Analytics: Prediction of salary and market value using machine learning (1/3).

Jorge Montaño Casillas
Analytics Vidhya
Published in
10 min readFeb 3, 2020

PART I

In the middle of December 2019, the women’s soccer league final between Rayadas and Tigres took place. It was a rematch for the title for both teams. The first match ended 1–1. The champion would be defined in the last 90 minutes.

Rayadas won with a lonely goal. Champions, finally, of the league. That same week, it was reported that the players would be rewarded with an Ipad for achieving the title. Two weeks later, the men won the men’s league title after beating America, although the prize they would receive would be a check for $2M dollars.

Source: Liga MX Women

This project arises from the need to understand what factors are being considered by professional clubs in Mexico to pay an average monthly salary of 4.2K pesos (223 dollars per month) to their professional female players. It may sound trivial, but it is about women winning a league championship that has television rights and sponsors who are committed to promoting a new category of the football market.

Source: @Hisweaty_

The money is on the table, why not take care to develop and promote the women’s league, with the success that each tournament has shown to have? This is a really important thing to analyze.

Not only are there differences in the prizes awarded by clubs to their players, in Mexico, there is also the reality of the wage gap in sport. A first division player receives an average monthly salary of $635K pesos (34K dollars per month), while a player of lower categories such as Sub 20 earns 30K pesos (1.6K dollars) and a Sub 17 earns $3K pesos (160 dollars) both on a monthly average.

Source: Deporte Inc.

This highlights that a professional female player in Mexico is valued in the same way as a semi-professional player. Not only is the huge wage gap, but the way to make the football show profitable for a club when it comes to men and women, from wages to income generation, to name a few examples.

During my research on the subject some questions arose:

  • How does a semi-professional player earn almost the same as a professional female player?
    • How is it that a development league that does not have television rights or ticket sales is a profitable business for the teams?
    • How does a development league have almost unlimited funding, and a professional women’s league does not, despite having sponsors and merchandise revenue?
    • How is it possible that the champions of the women’s league receive an Ipad as a prize for their effort and the men’s league players 2 million dollars?

The idea of ​​the project was simple: work with a database, train a model and predict the value and market salary of the players. IIf I can determine how much they should earn monthly and what their valuation should be in the market, then this exercise could be replicated for other sports to know how wide the wage gap is.

This is the first of three deliveries where I will explain in detail the process I followed to forecast market values ​​and salaries for 15 professional female players.

In particular, this article will talk about the database, the manipulation of the variables, the basic statistical analysis and how I made the geolocation of 651clubs using Folium.

Extract, Transform and Load (ELT) the FIFA 19 database

The first step of the entire exercise was to download the Kaggle FIFA 19 database ( https://www.kaggle.com/karangadiya/fifa19).

Source: EA Sports

Once downloaded, I imported it to Jupyter Notebook to start analyzing the type of data I would work with and how many missing values ​​the database had.

The next thing was to manipulate the names of the columns and populate some empty fields in the database.

Perform some unit conversions and data cleanup…

Create some new variables and, finally, establish the final database for all the analyses we will carry out.

The database contains 2 columns of special interest: salary and market value for almost 18,207 players. In addition to this, I had information about the physical characteristics and abilities of these players, so performing an “extrapolation” to female players could be something “simple”. Once the base was cleaned, the information was used to generate graphics that gave a piece of general information.

Insights

The FIFA 19 game database contains information on 651 clubs and 18,207 professional players. It is important to mention that the totality of players is not contained worldwide, but it allows to know the physical and personal characteristics of a group of outstanding players

The first graphic generated is based on the distribution of the positions of each player contained in the game. In particular, the most frequent position (ST) strikers, followed by (GK) goalkeepers and (CB) central defenders.

This graph shows us that the majority of players make, on average, the same effort attacking or defending. This is logical to observe because the players cover a particular position and, unless the team had an injured or ejected player, that effort would be increased to fill the missing vacancy. If this were not the case, no player would run out on the field trying to cover more than one position.

The diversity of countries present in the game coincides with what we would observe in reality. However, it is important to note that the game is biased because, for commercial reasons, the amount of equipment contained is concentrated in England, Germany, Spain, Argentina, France, Brazil, and Italy just to name a few.

The distribution of the salary mass is concentrated in a few players, which we will call “Super Stars”, so regardless of the country, position or physical characteristics it means that, players in this group have a salary almost 6 times higher than an average player.

On the weight of the players we can see that on average they are in a range of 70 to 80 kg, which could, and based on their height, it could be an advantage to be thinner to practice this sport since that speed and ability to dribble are key to taking a pass, anticipating a play or scoring a goal.

As for the height of the players, we can see that most of the players are less than 1 meter tall 60 cm. Few players are close to 2 meters tall and are generally defenders and goalkeepers.

The age of the players is mainly concentrated between the ages of 20 to 30 years, although with the presence of some very young players (16 years) or some older (43 years) who are active in the game database.

The following graph shows the correlation between the variables contained in the DataFrame. We can observe that the characteristics considered as skills have a high degree of correlation

When we analyze the nationalities of the major leagues, we observe a large number of foreign players. The above generates high competition with local players and, in some cases, allows to generate a dynamic where the competitive level is quite high and we find leagues with extremely interesting matches.

Due to this competition, most clubs seek to analyze certain variables that allow them to know the potential performance that a player can have on the field to decide to bet on them.

FIFA 19 has a pre-programmed player potential which determines how fast their attributes should grow and when they should stop through their professional career. This is how we know which players have everything to be the next Super Star

Although it only serves as a guide it could be an accurate guide, but it’s not guaranteed that any given player will get perform its full potential.

Lots of different variables in the game can stunt player growth, such as limited game time, little to no training, bad shape or continuous injuries. It can also be changed when a player transfers to a new club, and can easily go down because of a lack of rhythm, personal circumstances or the level of competition of the new league.

On the other hand, it is possible to observe that players with a high total valuation generates a high market value and reaches its peak between 26–27 years old and then begins to fall.

This is a reality, clubs hope to take advantage of the best years of their players and then renegotiate a lower salary or even a sale. These circumstances force many players to decide between having minutes of play with a lower salary or thinking about their retirement.

Finally, the last graph shows us that as the year’s pass, both the years and the potential have a convergence, which we could call the professional peak close to 29 years.

From this moment, the physical conditions stabilize and begin with a gradual descent that is normal and perfectly observable in the players

Folium

The way to geolocate the clubs present in the database was not easy.

The club names were often abbreviated, missing some acronyms or, in the worst case, were identical to the name of a city, continent or some mythological person.

This prevented me from performing web scrapping. Why?

I needed to find the latitude and longitude of each club. Ideally, use a web scraping tool that loops the list of clubs, searches them on Wikipedia and obtains the information. Easy, right?

As I mentioned before, the way clubs are written in the database made the process of obtaining latitude and longitude a headache.

Here is an example

Source: FIFA 19

The name of the club in the database is Guadalajara, but in Mexico is known as Chivas de Guadalajara or just Chivas. If you type this in Google, you will notice the result is about the city and not the club.

Source: Google
Source: Google
Source: Google

If we now look for the club’s nickname, the result is the one we want.

Since it is impossible to know how people know or refer to their club, the process to recover latitude and longitude was done manually.

Once the information was recovered, the next step was to create a new database with key variables. The variables were:

  • Name of the club
  • Name of the stadium
  • Country
  • Name of the league
  • Continent
  • Latitude
  • Longitude

The result in Folium looks like this

This is the first part of my exercise. If you want to know more about the project, I invite you to visit my project in https://jmcass.github.io/SportsAnalytics

Do not miss the second part of this article, where I will talk about the recommendation system that I used to know an initial level of wages and market values ​​for the players.

Thanks for reading and sharing!

--

--

Jorge Montaño Casillas
Analytics Vidhya

Economist passionate about sports, numbers, finance, music and Python