Building a Player Recommender Tool

Avneesh Singh Saini
Analytics Vidhya
Published in
13 min readJun 30, 2021
image credits: Real Python

I will take you through a journey of designing an end to end, simple implementation of a nearest-neighbors based recommendation system working on the latest football data available.

Contents:

  1. Problem statement
  2. Data gathering
  3. Data pre-processing
  4. EDA (Exploratory Data Analysis)
  5. Curse of Dimensionality
  6. Dimensionality reduction using PCA
  7. Recommendation system
  8. Deployment using Streamlit
  9. Future improvements
  10. Final thoughts

1. Problem statement

The goal (no pun intended) is to make an application which lets you find similar players to a specific selected football player. The latency requirement is that the results are fetched and displayed as rapidly as possible in real time with minimal loading times.

Existing solutions: StatsBomb are the industry leading organization when it comes to football analytics. StatsBomb IQ’s similar player search tool can be safely regarded as the state of the art solution but we don’t know how it’s implemented.

My solution is an independent attempt at designing an application with the basic functionality of getting similar players. This is an unsupervised learning problem and hence no performance metrics are used.

2. Data gathering

Sports reference provide open data for some of the major sports in the world. In our case we’ll be using one of its offerings, Football Reference (known as FBref). StatsBomb provides advanced statistics to FBref, so I would like to credit them both for the data. We’ll be getting 2020–21 season data for all players playing in the big 5 European leagues:

  1. England (Premier League)
  2. Spain (La Liga)
  3. Italy (Serie A)
  4. Germany (Bundesliga)
  5. France (Ligue 1)
There are different types of player stats available

To get data from the ‘Standard Stats’ section, go to the page, scroll down to the Players table and click on ‘Toggle Per90 Stats’ to convert all the statistics to a per 90 minutes value, meaning all the players now have a level playing field despite playing different amount of minutes in the season. Hover over to ‘Share & Export’ and click on ‘Get table as CSV’. Now copy the highlighted text as in the below picture, paste into a text editor and save it as a csv. Similar process applies to the remaining stats.

3. Data pre-processing

After getting all types of stats as CSV’s, it’s time to transform it into quality data and prepare it for the upcoming stages. Note, I will be segregating all data into two datasets: outfield players and goal keepers. Goal keepers will contain ‘Goalkeeping’ and ‘Advanced Goalkeeping’ stats while rest of all the stats will be included in outfield players dataset. Keep in mind, the steps performed from now on in this blog will be carried out for outfield players only, the same steps have been performed for goal keepers as well.

3.1 Combining into a single data frame

Reading CSV’s

The redundant features, which are being repeated in every CSV, are removed after reading the particular CSV (except for the first one).

Next up, there are some features which have same names among different CSV’s but actually stand for something different. After concatenating the data frames, we won’t be able to differentiate the features with same names. To tackle this, I am appending table number after the features using a custom function. Also, this would help in easily recognizing what type of stat a particular feature belongs to.

Renaming features
grand data frame

3.2 Some filtering

In this step, we’ll be selecting players with atleast 3 90s played and exclude goal keepers from the outfield dataset. Also as you can observe from the previous picture, the ‘Player’ and ‘Comp’ features are not in the appropriate format. As these features will be used to filter results in the application, we are extracting correct player names and only selecting the league competition name while removing country short names.

3.3 Dealing with null values

Upon checking using df.isnull().sum().sum(), there are 747 null/missing values in the dataset. This is because players of different positions might not have values for some features. We’ll be replacing these with 0 using df = df.fillna(0).

3.4 Dealing with duplicate names

Two entries for Morgan Sanson

If you’re not familiar, there are some players who go out on loan or undergo a permanent transfer to other clubs either in the summer or winter transfer window. As I want to include ‘Player name’ as a drop down filter in the application later, I would like player names to be unique. To resolve this issue, I have simply created a dictionary that maps ‘Player + squad’ to their indices as combining player names and squad (club names) creates unique keys.

Player-ID hash table

3.5 Adding new feature: Preferred foot

In the final application, I also would like to include preferred foot as a filter which is missing in our dataset. I will make use of passes made with the left and right foot to infer preferred foot by taking the ratio of these two features. If the ratio is greater than 1, then it’s mostly a left footed player else right footed.

3.6 Final datasets

Outfield dataset

Outfield players = 2040 players, 177 features (164 statistical features)

Goal keepers = 173 players, 52 features (40 statistical features)

4. Exploratory Data Analysis

With EDA, we can visualize, interpret and draw insights from our dataset. This is made possible with the help of statistical graphics and data visualization techniques.

4.1 Position analysis

Serie A (Italy) has the highest number of defenders (DF), midfielders (MF) and forwards (FW) but it lags behind in other variations of these positions. The number of defenders are significantly higher than any other position and the scarcity of players increases as we go further up the pitch.

4.2 Age analysis

Pretty interesting to note that the distribution for age follows almost like a bell curve or normal distribution (with a dipping tail) having a mean of 25.78 and standard deviation of 4.22. More than 50% of the players lie between the ages of 23 and 29.

Ligue 1 (France) has the youngest crop of players in general among the big 5 European leagues with a median of 24. But interestingly Ligue 1 also contains the oldest player in the dataset as we can observe from the above plot, Brazilian centre-back Vitorino Hilton aged 42, who plays for club side Montpellier.

4.3 t-SNE

t-distributed Stochastic Neighbor Embedding (t-SNE) is a tool primarily used to visualize high dimensional data and for data exploration. In simpler terms, t-SNE gives us a feel or intuition of how the data is arranged in a high-dimensional space. In our case, we’ll be converting 164 features into two features and plotting the two obtained dimensions as a scatter plot.

The positions DF (blue), MF (orange) and FW (red) are fairly separated from each other. DFMF (brown) is in and around DF while players with position FWMF (green) are near the FW group. Intriguingly there are two separate groups for DF (bottom and right), one for central defenders and the other for fullbacks (they mostly operate on the wide areas of the pitch).

5. Curse of dimensionality

Curse of dimensionality refers to a set of problems that arise when working with high-dimensional data. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In simpler terms, there comes a certain point when increasing the dimensionality of the problem by adding new features would actually degrade the performance of our solution. There are mainly two facets of curse of dimensionality:

Data sparsity: As dimensionality of the dataset increases, the data points occupy lesser space. This could lead to high variance or overfitting in supervised learning problems.

Distance concentration: Distance concentration refers to the problem of all the pairwise distances between different samples/points in the space converging to the same value as the dimensionality of the data increases. Due to this, the concept of proximity or similarity of the samples may not be qualitatively relevant in higher dimensions. We’ll tackle this issue in the next section.

6. Dimensionality reduction using PCA

credits: this wonderful answer on StackExchange

Principal Component Analysis is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of features into a smaller one that still contains most of the information in the large set. The above graphic is a nice illustration of data points in 2D space being projected to 1D such that it preserves the maximum variance or spread in our data. Observe when exactly are the red points (projections) on the rotating line the most far apart? They are when the line aligns with the pink segment on the sides and this, in this case, is called the first principal.

PCA vs t-SNE: One of the major differences between PCA and t-SNE is that t-SNE preserves only small pairwise distances or local similarities whereas PCA is concerned with preserving large pairwise distances to maximize variance. In our case we’ll be using t-SNE as just a visualization tool while PCA will be carried out for dimensionality reduction and the selected components will be used in the final solution.

As we can observe, even just 40 components explain 90% of the variance. I’ll be selecting the first 90 principal components which explain 99.5% variance. But what does this actually mean? It shows that most of the features previously were either correlated or even after transforming the samples into lower dimensions, we are able to preserve almost all of the variance while also tackling the problems of curse of dimensionality, thanks to PCA.

selecting 90 components

7. Recommendation system

Our ultimate objective is to find the most similar players to a search query but how can we accomplish that? Straightaway Euclidean distance or Manhattan distance springs to mind. But they would not be the appropriate metrics for this problem especially when dealing with a real world dataset of football players operating in different roles and being diversely rich in terms of their statistical output. When dealing with high dimensional data, Euclidean distance can misrepresent the measure of similarity between two vectors. The cosine similarity metric can help overcome this problem.

Cosine similarity: Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. Its value ranges between -1 and 1, where -1 is perfectly dissimilar and 1 is perfectly similar.

Cosine similarity formulation

The below figure is a nice illustration portraying the differences between Euclidean distance and cosine similarity. As you can see, the distance d2 is smaller than the distance d1, which would suggest that the agriculture and history corpora are more similar than the agriculture and food corpora, which is intuitively wrong. Now with the help of cosine similarity, we measure the inner angles and take cosine of it. If the angle is small, the cosine tends to 1. As the angle approaches 90 degree, the cosine tends to 0. This implies that larger the angular delta between the two vectors, lesser the similarity. Hence in this case, cosine similarity shows agriculture corpus to be more similar to food corpus than the history corpus, which intuitively makes sense.

Image credits: aman.ai

Code implementation: Firstly, the getStats function returns the specific player’s vector using player_ID hash table implemented before. This function will be used in computing cosine similarity.

Cosine Similarity = 1 - Cosine Distance

The above formulation is being used in the similarity function. As cosine distance decreases, the similarity value will increase (tends to 1) and vice versa.

Next up, we are iterating through all the 2040 players, and getting similarity values with every player in another loop. After fetching the values in a list, we want to make the similarity values a bit presentable and interpretable. So for that, I am normalizing the similarity values in the range of 0 to 100. Now if you didn’t notice, won’t the similarity value of query (outer loop) with the player (inner loop) be 1 if both are the same players? (think diagonal elements). Yes, diagonal elements will be 1, so when normalizing and sorting according to the similarity values, we will always get the selected player itself with a 100% match. To fix this, I’m just de-selecting the top result and displaying the rest. As a tradeoff, no player will have a 100% match with another player, which intuitively makes sense as well.

Latency requirement: The above algorithm runs in quadratic time and takes around 3 min 45 sec for completion. We can’t afford this much computational cost at real time in our application. Hence, I’m saving the dictionary as pickle file which can be loaded later (pickle is a way to store Python objects locally for later use). Retrieving values from a key in a hash table is a constant time operation i.e. O(1). After getting all similarity values for a specific player, I’m hashing the normalized list to the appropriate key (name + squad as implemented earlier). The speed issue is now fixed for displaying results in real time.

8. Deployment using Streamlit

Streamlit is a very convenient resource to turn data scripts into web applications with no front-end experience required. But there are some basic Streamlit concepts that you should get familiarized with:

st.cache() - This function decorator helps memoize the function execution. In simpler terms, when the getData function is called for the first time, it will load the pickle files and return those already loaded files whenever getData is called again (with the same arguments though). So by just using st.cache, we avoid our application from loading all the data from scratch every time a parameter/filter is changed and hence preventing long loading times.

st.cache usage

Interactive widgets - With widgets, Streamlit allows us to bake interactivity directly into your apps with buttons, sliders, text inputs, and more. Also, we can just assign a variable name to a widget and get the user input very easily as can be seen in the code below. If radio equals ‘Outfield players’, we simply load the data from outfield_data else from gk_data. st.beta_columns is another functionality that lets us align the widgets column wise, taking in a list of sizes of the columns as the argument.

Interactive widgets

Similarly, using interactive widgets functionality of Streamlit, following filters/parameters have been added:

  1. Player type: outfield or goalkeepers data set
  2. Player name: get similar players for this selected player
  3. Preferred foot: preferred foot of the players in result
  4. Comparison with: comparison with the same position or all positions
  5. League: league competition to get results from
  6. Age bracket: results will be returned based on a particular age window
  7. Number of results

Now let’s dive into the code which will enable the application to return and display results corresponding to any change in the filters mentioned above.

After taking in the user input from the widget variables, they are passed on to the getRecommendations function. Then, the basic features to be displayed as a table are selected (including foot type in outfield results only). The ranked similarity values for the selected player are added to the dataframe and the top result, which is the player itself with 100% match is de-selected as discussed earlier. After this, the results are being filtered according to comparison type, league, age bracket and foot type. Finally the results are returned based on the count variable and displayed in the form a table using st.table (yes, Streamlit supports magic commands too!)

Final application

You can access the application here. Now it’s time for you to play around with the tool and search for similar players to your favorite ones. Who knows, you may unearth some hidden gems like this gentleman below:

9. Future improvements

  1. Automation: The data gathering process can be automated using Selenium/Requests and Beautiful Soup. As the regular season has concluded, I didn’t feel the need to dig into automation for this project but it will be very convenient during the season when the data gets updated after every matchday.
  2. Team bias: Now this is more of a football problem than a data problem. The issue is that a specific player’s statistical output is heavily influenced by the side he’s playing in. Think style of play, a more ball dominant team’s midfielders will have high values for passing and other possession based features. Due to this, in some cases, the midfielders of that team can have high similarity values with each other. Minimizing or eradicating this bias is a non-trivial problem and may require some research.
  3. Weighted features: Players have different roles within a team which require them to excel in relevant aspects of the game (features). Thus it would make sense to assign different weights to some features when computing and getting results for a specific player. Finding those weights is again a non-trivial football problem. One alternate way would be using One-Hot encoding on labels that describe role of a player, which would be more diverse than position labels but we’d have to build it manually for each player!

10. Final thoughts

This project was just simply born out of curiosity and then uncountable experiments. If you learnt anything new of value, I’d have accomplished the objective of this blog post. You can reach out to me on Twitter and connect with me on LinkedIn here.

--

--