Premier League 2021–2022: Player similarity analysis

Luis Alvaro
INST414: Data Science Techniques
3 min readMar 13, 2023

This assignment aims to explore and measure similarities between Premier League football players for the 2021–2022 season. Our code will aim to identify three target players and rank the top ten most similar players to them based on different football stats for the 2021–2022 season. For instance, we may want to find out about the players that were the most similar to the best player of the season or teams may want to scout a more affordable player similar to their target player.

I used FootyStats’s football data API to extract the raw data for players playing in the Premier League in the 2021–2022 season. The raw data was given in JSON format, then I proceeded to collect specific stats for each player. Further on, I created a data frame from the collected data with players as an index and where each column represents a feature or a specific statistic we want to compare. The features we are using to determine similarity are overall minutes, appearances, assists, avg playing time, and yellow/red cards. Finally, the similarity metric would be euclidean distance.

The software that I used was Python. Furthermore, the following package and methods helped me gather this data and calculate the distances.

We can see that the first and fifth columns are not consistent with the other columns we would need to rescale them to [0,1]. A min/max or L2 normalization would be applicable in this case as we want to rescale the columns to a more consistent range. Applying this normalization procedure gives us the following.

Next, we can calculate the distance after min/max column normalization. Our 3 targeted players or queries are Paul Pogba, Gabriel Martinelli, and Leandro Trossard. Using spicy.spatial.distance.cdist() to get the distances from our targeted players we get the following output. The values next to the names give us the euclidean distance where a value approaching 0 means it’s highly similar and a value approaching 1 means it’s less similar.

Cleaning the data was a relatively easy process since the data was gathered from FootyStats.org, a well-maintained API. Where simply the majority of it was parsing the JSON response and transferring the data into a data frame. There was no need to format the data or further clean it. One issue this API may have is that it only provides 199 players and their statistics for the 2021–2022 season for unknown reasons. The total number of players in the Premier League is much larger so there could be some margin or error in our ranking. Furthermore, our similarity ranking could be different for these 3 players if we decide to pick different statistics as features/columns. For this case, I decided to focus only on the 7 statistics from above. This means that our 10 most similar players to Pogba/Martinelli/Trossard are in terms of goals, assists, total minutes played, average playing time, and red/yellow cards.

The code for the analysis can be found here

--

--