Building a Soccer Player Recommendation System
Update:
The Streamlit Application for this project has been deployed. You can access it by clicking the following link. The recommendations are based on player stats at the end of the 2022–23 Season.
Enjoy the rest of the blog.
Introduction
With Soccer (or Football) fans so large in number all across the globe, the world tunes in whenever a major event in the sport takes place. Be it at the club or international level, fans throng in support of their favorite teams, and a big part of these teams are the players who represent them. Therefore, it is with no surprise that when the transfer window opens, the world is curious to know what the new player-team pairings will be.
But what if a team isn’t able to sign their primary target player? What are their alternative options that are similar to their primary target? It is in light of this that I have built a simple player recommendation system that suggests the Top 10 most similar players for a given input player using Principal Component Analysis (PCA) and a Self Organizing Map (SOM) using Cosine Distance and Bray-Curtis Distance.
Let’s see how this can be built.
Data Collection
The data used to build this Recommendation System is obtained from fbref. However, the data was obtained using the worldfootballR library, which extracts data from the fbref website. The data that I used consists of player statistics from the 2022–23 season from players belonging to Europe’s Top 5 leagues, the Liga Primeira, and the Eredivisie. The players’ stats included their Standard Stats, Shooting, Passing, Defensive Actions, Pass Types, Goal and Shot Creation, Possession, and Miscellaneous Stats. All of these can be seen on the fbref website.
The worldfootballR library, as mentioned in the name, is specific to R, and the R code to obtain the data for Europe’s Top 5 Leagues, Liga Primeira, and Eredivisie, is in my GitHub Repository (Link at the end of the Article). However, let’s go over the functions in this library that are used to obtain the data.
For Europe’s Top 5 Leagues, the function ‘fb_big5_advanced_season_stats’ is used. Its description and usage are shown in the picture below.
For the Liga Primeira and the Eredivisie, we use 3 functions. The first one is ‘fb_league_urls’ and we use it to get the URL of the webpage of the league in a particular season from fbref. Its usage and description are given below.
Thereafter, we need to get the URLs to the web pages of each team that took part in that particular season for that particular league. For this, the function ‘fb_teams_urls’ is used and its description and usage are as follows.
And finally, to obtain the stats of the players of each team in that league in a given season, we use the ‘fb_team_player_stats’ function. Its description and usage are as follows.
The code for obtaining the stats can be found in my GitHub Repository as R-Markdown files. However, to give you a gist of what is happening — we scrape the stats for each category and for each player into dataframes, then perform inner joins to get a single dataframe with the stats of all the required categories, and finally we remove those players who are Goalkeepers as we are only concerned with Outfielders. We then save the final dataframe into a .CSV file that we can use later.
Now let’s move on to Data Preprocessing.
Data Preprocessing
As part of the Data Preprocessing step, the most important task is to sum up the stats of players who have played across multiple clubs within the same season. This ensures that every observation is a unique player. To do this, we create a new variable that holds the concatenated value of the Player’s name, Birth Year, Nation, and Age, and then perform a group by using this variable while summing up the stats. This ensures that every player is represented as a unique observation. We perform this activity to take care of players who were transferred within the same league or across leagues in a given season.
The second important step is more related to specific to Liga Primeira and Eredivisie. Here, we need to rename the columns so that they match the column names of the dataframe containing Europe’s Top 5 Leagues. Additionally, the Liga Primeira and Eredivisie dataframes don’t have the league names for all the observations. This has to be fixed.
Finally, prior to modeling, we need to make sure that every player statistic except for Matches Played and Matches Played is converted to their Per 90 statistical form. Here is the formula.
The Python code for this is provided in the GitHub Repository. A point to be noted is that if a player does not have a value for a specific stat, we impute that missing value to 0.
Before we go ahead, here’s a brief description of the variables in the dataset.
Principal Component Analysis (PCA)
As of now, the final dataset has around 103 dimensions that we wish to base the recommendation system on. And since the system works on the principle of content-based filtering, it is important to find points, or in this case players, that are close to each other. However, the number of dimensions are very large to compute the neighborhood of a point. Hence, we need to perform dimensionality reduction, which is achieved using Principal Component Analysis (abbreviated as PCA).
Essentially, what PCA does is find new dimensions that are orthogonal to one another. These new dimensions (the eigenvectors) are ordered in descending order by their respective eigenvalues, where the eigenvalue explains the variance. So the dimension associated with the largest eigenvalue has/explains the largest variance in the dataset, and so on. If you need a refresher on eigenvectors and eigenvalues, I suggest this article.
To briefly explain the working of PCA
- The covariance matrix of the dataset is calculated
- Eigenvectors and their respective eigenvalues are found. Eigenvectors are sorted in descending order of their respective eigenvalues.
- We take only the k eigenvectors that explain the required amount of variance in the data. In our case, we choose the eigenvectors that explain 95% of the variance in the data.
- Perform Matrix multiplication with the k eigenvectors to get the new dataset with k features.
However, prior to performing PCA, we need to standardize the data. Through standardization, we ensure that the principal components obtained are not biased toward the variables that have larger scales. We create a pipeline to perform standardization and then PCA (Code in the link at the end of the article).
Once PCA is completed, we end up with a dataset containing 42 variables (excluding the encoded variables for player position, which we will see next). We will then use these variables along with the encoded variables for player position in the Self Organizing Map which we’ll look at next.
Self Organizing Map (SOM)
As the name says, a SOM is a map or a lattice of neurons, where each neuron may end up becoming a cluster centroid for a certain set of input observations.
The best way to understand how a SOM works is to visualize it. Imagine a set of points on a 2D flat surface and look at the surface from below. Each of these points is a neuron and each neuron is connected to every input (i.e. every observation in the dataset) where the connection is nothing but a vector of weights. The number of weights in the connection is equal to the number of features in the dataset.
The goal of a SOM is three-fold:
a) An input observation is assigned to the neuron that is most similar to it using with respect to a certain distance metric. (Competition)
b) Find the weight values of a neuron such that neighborhood neurons have similar values. (Collaboration)
c) Each unit becomes a cluster centre where a cluster contains a certain set of observations. (Weight Update)
Let’s look at the steps involved in SOM.
- First we initialize the weights in the connections randomly. Remember that a connection is a weight vector representing a neuron.
- Next is the competition step. For every input observation, we find the neuron that is most similar to an input observation using a specific distance metric. In the case of this project, I’m using a combination of Cosine Distance and Bray-Curtis Distance. Remember that this step will most likely result in a specific neuron being similar to more than just 1 observation. This most-similar neuron is also called the Best Matching Unit (BMU).
- Now we arrive at the collaboration step. Here we have to understand that every neuron has a neighborhood of neurons identified by a specific function. You can visualize this as a circle with its centre on the BMU and a radius extending from the centre resulting in a circle, and whatever neurons are in that circle are in the neighborhood of the BMU. As the word collaboration says, the aim is to try and make every neighborhood neuron similar to the input vector. This is where the weight update step comes into play. A point to note is that, over multiple iterations, the size of the neighborhood gets smaller. Moreover, the weight update step involves the use of a learning rate which also decreases as the number of iterations increase.
- Step 4 is to repeat the steps 2 and 3 for a certain number of iterations.
At the end of it, we will have clusters of input observations where each cluster’s centre is a neuron in the map. I won’t go through the details here to save time, but you can read this article to get a step-by-step detailed idea of what happens in an SOM.
Some important hyperparameters in the SOM are the number of neurons, the parameter that controls the size of a BMU, and the learning rate.
For this project, I have made use of the ‘minisom’ library. The link here provides the details of the library. One more point to note in the case of this project is the grid size I have chosen. I have picked a grid size of 2x3 which is basically a plane of 6 neurons arranged in 2 rows and 3 columns. This resulted in 6 clusters of reasonable sizes. Moreover, I noticed that even if I increased the grid size, the topographic error and the quantization error were small and did not change much.
Now that we have our observations clustered, we come to the final step — the Recommendations.
Recommendations
Our aim is to recommend the top 10 most similar players for a given player. For this, we basically use the content-based filtering concept.
- Prepare a distance matrix for every cluster obtained from SOM. The distance metric is the sum of the Cosine Distance and Bray-Curtis Distance.
- For a given input player, identify the player’s cluster and thereby the distance matrix. Now once we have the distance matrix, we simply look for the 10 closest observations (i.e. players) for the input player using the distances in the distance matrix calculated for the cluster of observations that the given input player belongs to.
- Once we get the 10 closest players for the given input player, we can visualize them using a radar plot. I have built the radar plot using a few of the features from the original dataset to create a plot that contains 2 radars — one for the input player and the other for the recommended player. This results in 10 radar plots where each plot contains 2 radars. For this, I made use of the ‘mplsoccer’ and ‘matplotlib’ libraries. The link here refers to a documentation from ‘mplsoccer’ that explains how to build the radar plot.
Here’s an example of one of the recommendations for Luka Modrić, the star midfielder of Real Madrid.
Code
The R code required for scraping the data, and the Python code required for cleaning the data, data pre-processing, and the model itself are all in the following GitHub Repository.
URL: https://github.com/sameerprasadkoppolu/Soccer-Player-Recommendation-System
Finally, I hope you enjoyed your read. I encourage you to try it out for yourselves and have fun with it! Maybe this could help you in Manager Mode on FIFA.