Discovering the Play-Style Doppelgängers of the World’s Greatest Soccer Players

Eitan Zavorin
INST414: Data Science Techniques
5 min readMay 1, 2024

In the rapidly evolving world of soccer, identifying player similarities is a crucial task that can provide valuable insights for various stakeholders, such as coaches, talent scouts, club executives, and fans. By carefully measuring similarities between soccer players based on their play styles and skill attributes, we can answer very important and controversial questions about player comparisons and potential talent recruitment strategies. In this report, I will delve into the process of creating a model that pinpoints the most similar soccer players to three selected individuals. Utilizing cosine similarity calculations on a vast dataset of players and their attributes, I aim to uncover hidden patterns and similarities that inform decision-making processes in the world of soccer. Knowing that the most controversial question among soccer fans around the world is whether Lionel Messi or Cristiano Ronaldo is the best player in the world, I’ve chosen the following research question: With big names like Messi, Ronaldo, and Wayne Rooney stealing the spotlight over the recent decades of soccer, who are the most comparable players to them that deserve more attention?

Before diving into my investigation, it’s crucial to understand my stakeholders so that I can bring the most value to them in the creation of my model and the discoveries I make along the way. One primary category of stakeholders invested in knowing player similarities is team recruitment staff and talent scouts. These professionals seek to proactively identify potential talent prospects for transfer or recruitment to improve team performance. Scouts can make educated decisions regarding transfer targets by accurately assessing player similarities, ensuring that new recruits complement pre-existing team dynamics. Additionally, coaches and team managers could benefit from insights into player similarities, utilizing them to plan training sessions, tactics, and formations based on player profiles and play styles. Lastly, fans will always be a stakeholder in any soccer-related investigation such as this one. Fans in soccer communities around the world could benefit from deepening their understanding of player performance and the game, enhancing their viewing experience, fostering engaging discussions about the sport, and even inspiring them to learn from the greats.

The dataset used in this analysis comprises of the Player and Player_Attributes tables from Kaggle’s public European Soccer Database. These tables contain detailed attribute-related information on soccer players, such as finishing, dribbling, ball control, sprint speed, passing, and more. Each record in the dataset represents a soccer player and includes various quantitative metrics that quantify their play styles and skill levels. Using these attributes for players in the database allows us to measure similarities based on multiple dimensions.

To collect the dataset, we transformed the publicly available SQLite format available on Kaggle into a CSV file. Using the pandas library, I extracted the data and transformed it to create a structured dataset suitable for analysis. I cleaned the data by removing duplicates, removing rows with significant missing values, and selecting only the relevant quantitative attributes for similarity measurement. Eventually, I was left with a dataset of about 600 player rows, including only the relevant player attributes necessary to proceed with my investigation.

In my analysis, I measure the similarities between soccer players using cosine distance calculations, which also tell us the cosine similarity. Each player is represented as a vector of attributes including finishing, dribbling, ball control, etc., and the cosine distance is computed between pairs of player vectors to quantify their similarities.

For three selected players — Lionel Messi, Cristiano Ronaldo, and Wayne Rooney — I identify the top 10 most similar players in the dataset based on cosine distance scores on all their relevant play style attributes. By discovering players who exhibit similar skill profiles and playing styles, we gain insights into some very influential but possibly less appreciated players in the game. Here are my findings:

The analysis revealed interesting patterns of player similarities across various dimensions. For example, Ronaldo, known for his exceptional finishing and long shot abilities, demonstrates high similarity scores with players who excel in similar attributes. Similarly, Messi, renowned for his outstanding dribbling, ball control, passing and vision, shares similarities with players who exhibit similar playmaking abilities. Thirdly, Wayne Rooney, whose strength, aggression, and shooting traits stand out amongst the competition, pair him closely with many players known for sharing the same characteristics. These findings provide an answer to our research question, as well as numerous valuable insights for talent scouts and team managers seeking to identify players who align with specific skill requirements that only some of the best players are known to have.

Despite the insights gained from my investigation, it’s crucial to acknowledge the limitations and biases inherent in my approach. One limitation is the complete dependence on quantitative attributes, which may not capture the full spectrum of player abilities and characteristics. Some qualitative measures, such as attacking and defensive work rates (assigned as low, medium, or high), are important to take into account when measuring the player profiles and players. Additionally, the model’s calculations may be biased towards players with skill profiles or playing styles prevalent in the dataset, possibly overlooking players with unique sets of attributes or unconventional play styles. Lastly, this data was last updated eight years ago, so it’s missing updates on many player profiles and completely misses any players that emerged onto the scene in the last eight years. Regardless, this model would work on up-to-date datasets, and the insights that our stakeholders can gather from it are still very much valid and valuable.

In conclusion, my analysis exhibits the power of data-driven approaches in uncovering player similarities and informing decision-making processes in various aspects of the world of soccer. By utilizing cosine distance calculations on a vast dataset of player attributes, we can gain valuable insights into player comparisons and talent recruitment strategies. Moving forward, additional research and refinement of our model could improve its accuracy, relevance, and applicability, ultimately empowering stakeholders to make more educated decisions in the dynamic world of soccer.

GitHub Repository Link: https://github.com/eitanzav/INST414Module3Assignment

--

--