Soccer Analytics: Prediction of salary and market value using machine learning (2/3).

Published in

Analytics Vidhya

8 min readFeb 10, 2020

PART II

In this second part, I will explain why I used a recommendation system to know an initial value against which to compare the findings of the machine learning model used to know the salary and market value of professional players.

The article is divided as follows: In the first part, it is explained what a recommendation system is; in the second part, the process to obtain the data is explained; In the last part, the results are shown.

Recommendation System

A recommendation system works, in essence, in the same way as in daily life. Imagine that you want to invite your crush to dinner, but you don’t know which place would be ideal for a date with her.

What do you do? Naturally, the first thing you think about is asking your friends or family.

Why? Because you and your friends will have coincidences in tastes and, based on your experience, is that you assume a position of “confidence” in that recommendation.

That is, if one of your friends had a good experience or memory, this gives you, a priori, a sense of validity for you about something completely unknown to you.

This is still valid despite the amount of data we generate daily and that allows search engines like Google to make recommendations based on our interaction with various social networks.

It is important to mention that the recommendations we get from our friends or family will always be limited, because it is not possible, without technological tools, to know all the options available in the world.

Fortunately, there is a concept called: collaborative filter.

The collaborative filter is a technique to solve the problems of having an excess of information faced by consumers of any good or service in the world.

Many companies and websites incorporate a tool where the consumers themselves “build” a “collective recommendation” that associates those people who have similar preferences, who will receive “targeted” information or advertising based on the clicks they have previously given on a product.

On the other hand, other types of filters are:

• Content-based filter: recommendations are made according to the tastes or interests of the consumer

• Demographic filter: the recommendations are made based on the characteristics of the users, taking into account age, school grade, location, gender, etc.

Hybrid filter: they are the result of using any of the filters mentioned above to enrich the user experience

The way to apply this type of filter is through metrics or distance measurements, depending on the type of data used. These types of measures are often referred to as “recommenders.”

A recommender collects and analyzes the preferences of the users of a website (online stores, social networks, music or movie sites, etc.).

The main idea of the recommendations is that users with similar activity or tastes will continue to share their preferences in the future.

When recommending to new user products or activities that other users with similar tastes have previously chosen, the degree of success over their preferences will tend to be increasingly high and precise.

The way to find the most related users and use this information to predict their preferences is called clustering and is to find an optimal subdivision of a data set so that similar data belongs to the same group.

One of the metrics used to calculate this affinity is the “Euclidean distance” which is nothing more than the generalization to N dimensions of the Pythagorean theorem.

However, other measures that exist are:

Minkowski distance
Manhattan distance
Chebyshev distance

However, other measures that exist are:

• Minkowski distance
• Manhattan distance
• Chebyshev distance

Some companies that have this type of tools are:

• Facebook, Instagram, Twitter, and LinkedIn generate a recommendation from people based on the people you know and their links and this allows the information you see to be related to the type of social profile you have
• Amazon recommends products based on past purchases, ratings and purchases or ratings from other users similar to you
• Netflix generates its recommendations based on the films you have seen, the rating given and movies that users similar to you saw

However, the key to making a good recommendation is to know how similar two users are.

In this case, the way we will do it will be through the physical attributions and abilities of the players.

The first thing we will do is find the matches between the players that are in the FIFA 19 database.

Once we have these distances, we will proceed to find the information about15 professional female players and, finally, we will obtain the initial values.

Information Extraction

Due to the temporary 10-day restriction I had for my final project at Ironhack, it was only possible for me to process the information for 15 professional players, where 3 are Mexican female soccer players.

It is important to mention that FIFA 19 includes in the game 22 of 24 women’s national teams as part of an update of the game that gave football fans a chance to simulate the women’s soccer world cup.

However, the information of any woman is not included in the database available. At first, I had no way of using a recommendation system without the data of the players.

My first decision was to search the Transfermarkt portal (https://www.transfermarkt.com/) for professional players. The problem is that this famous portal, for having a lot of player information, forgot that women also play football.

Let’s see an example

Luckily, there is the FIFA Index portal (https://www.fifaindex.com/es-mx/) where I could check the physical characteristics and skills of the players in question.

Since I didn’t have a list of all the players that participated in the World Cup, I only took a group of players.

Maybe later I will recover all the information of the female players, but for now, with these 15 players, we have quite powerful insight.

I stored the information of the players in a new database created specifically with the information I needed.

That is, since it was not possible to recover data such as “Club”, “Position”, “Jersey”, “Loaned from”, “Joined”, “Contract valid”, among others, if I include the information of the players I would have problems of empty values and it was an issue that I had previously resolved.

Results

The first step was to import the final database we saw in the last article. This database is completely clean and complete.

Once imported, all the numerical fields were left to calculate the Euclidean distances.

The results of these distances transformed to leave them as if it were a correlation matrix, where the diagonal will always be 1 and the rest of the values will vary between 0 and 1. The results are shown in the following table.

With these results, I sought to verify that the recommendation system worked.

In particular, I looked for which player was more similar to ‘L. Messi ’ and, naturally, the player most similar to him is ‘Cristiano Ronaldo’.

Once I checked that this worked, the next step was to include the information of the female players to the recommendation system and calculate the distances again.

The way to enter the data of the players was as follows:

The procedure is the same, calculate the distances and regenerate an array of values ranging from 0 to 1.

Finally, the 3 Mexican players selected were: ‘Kenti Robles’, ‘Stephany Mayor’ and ‘Charlyn Corral’.

The values shown refer to the first 5 players that are similar to them.

Where the closest player is ‘S. Phillips ’where for the physical characteristics and skills, the salary of this player should correspond to 1000K euros per year or 20.5K pesos. This is almost 5 times what they receive on average per month in Mexico.

The result of Stephany Mayor result is similar to that of Kenti Robles.

Finally, the analysis for Charlyn Corral shows that his salary should be close to 41.1K pesos, which is a huge difference from the wages paid in Mexico.

From this exercise, it is possible to understand why it is necessary to study the salary differences between men and women in sport.

An analysis of specific elements such as physical characteristics and attributes provides a large amount of relevant information that may well be of interest to any football club that seeks to attract the best talent possible to compete to win a title.

In particular, in the case of Mexico, it is clear that if this exercise is extended to all female players, the results will be more than revealing, and that it is urgent to review the working conditions of all professional female players.

I invite you to read the last part of this project with the machine learning model, where the final salary and market value prediction is made and compared with the values found in the recommendation system.

If you missed the first part, you can check here http://bit.ly/31Ft45d and the final project https://jmcass.github.io/SportsAnalytics/index.html

Thanks for reading and sharing!

Soccer Analytics: Prediction of salary and market value using machine learning (2/3).

PART II

Recommendation System

Information Extraction

Results

Written by Jorge Montaño Casillas