Finding Player Similarity Based on Skill Ratings in EA Sports FC

Luke Walsh
INST414: Data Science Techniques
7 min readApr 19, 2024

EA Sports FC (EA FC), also known as EA Sports FIFA, is one of the highest played soccer video games. This game hosts a multitude of game modes, including things like quick play, career mode, and ultimate team. Ultimate team is the game mode that this analysis will apply to the best. Players in EA FC are given ratings based on their skills and success at the beginning of the season. This rating can fluctuate depending how they do week by week. In the ultimate team game mode, users buy their own team with virtual money that they gain from playing games against other users and by completing challenges. So knowing players and their ratings is important. But sometimes it can be hard to know who to buy if you are a beginner, so having a way to find similar players would be helpful. This is where my similarity analysis can help. The question it can help answer is: given a player and their skill ratings, what are the top ten most similar players to that player? The specific stakeholder that this question can help are players of EA FC. If a person wants to replace or add a similar player on their team, this analysis will give them a way to make an informed decision.

The data that I need to answer this question is a recent snapshot of all of the players and their ratings in EA FC. The features that I would need are the players name, country, club/team, and all of their skill ratings. These ratings are measured 0–100, and rate things such as passing, shooting, agility, and strength. The more skills we have, the more accurate the analysis will be. So having many different skill ratings will allow our similarity analysis to have more to compare. This data is relevant to my question because it gives us the numerical data needed to do a similarity analysis.

The final dataset that I used actually came from a user on GitHub that used a web scraper to get data from a source that had all of the player ratings and information. The original data source is a website called sofifa.com, which has all of the player information that is used in the actual EA FC game. I needed a recent snapshot of the players ratings, and this dataset includes data from March 20th, 2024, so it is still fairly recent and accurate. The features included categorical information like the players name, country, club/team, position, and play styles. The numerical data included more than 30 ratings of the players abilities. These included dribbling, volleys, crossing, and reactions. This came in a csv file that can be loaded straight into a pandas data frame for analysis.

To measure similarity between points, I will be using cosine similarity. Cosine similarity allows for us to find similarity based on the angle between points rather than only the distance. This means that it is robust to differences in scale. This is important in player comparisons because there are players like Erling Haaland and Kylian Mbappe, who have very high ratings. Instead of just finding other players with very high ratings, we can find players that are similar based on their actual skills ratings and how similar they are to the given players ratings. This should allow me to find players in a way that lowers the influence of differences in scale. To do this, I used the cosine similarity functions from sklearn. This library allowed for me to easily perform the cosine similarity calculations.

It took several steps to clean this dataset before I could do the analysis on it. First, I had to drop all of the rows that I didn’t need. There were 33 rows that I didn’t need to do the analysis, so I dropped those from the pandas dataset. This included features we don’t need for analysis like pictures, logos, body type, and other variables that wouldn’t be useful to the analysis.

final_df = player_df.drop(columns=['full_name', 'version','description', 'image', 'potential', 'wage', 'preferred_foot', 'international_reputation', 'work_rate',
'body_type', 'real_face', 'release_clause', 'specialities', 'club_id', 'club_logo', 'club_rating',
'club_kit_number', 'club_joined', 'club_contract_valid_until', 'country_id', 'country_league_id', 'country_league_name',
'country_flag', 'country_rating', 'country_position', 'country_kit_number', 'play_styles', 'gk_diving', 'gk_handling', 'gk_kicking', 'club_position'])

Next, I changed some of the values in the player name and club columns. I dropped duplicates in the player name column and made all of the names lowercase to make them easier to search. I then created a list of accented and special characters that are used in other languages and replaced them with their corresponding English letters.

#make the player names lowercase to make it easier to search
final_df['name'] = final_df['name'].str.lower()

#player_df = player_df.drop_duplicates()
final_df.drop_duplicates(subset=['name'])


#remove accented and other special characters from the players names. These were some of the most common ones I saw.
mapping = {'á': 'a',
'é': 'e',
'í': 'i',
'ó': 'o',
'ú': 'u',
'ñ': 'n',
'ã': 'a',
'ë': 'e',
'ş': 's',
'ă': 'a'}

final_df['name'] = final_df.name.replace(mapping, regex=True)
final_df['club_name'] = final_df.club_name.replace(mapping, regex=True)

Lastly, I noticed that the web scraping process may have made mistakes in the gk_reflexes and gk_positioning columns. These rows pulled categorical data from the row next to them called play_styles, that includes words. So I filtered out the rows that did not contain numbers, which dropped the goalkeepers with corrupted values.

final_df = final_df[pd.to_numeric(final_df['gk_positioning'], errors='coerce').notnull()]
final_df = final_df[pd.to_numeric(final_df['gk_reflexes'], errors='coerce').notnull()]

This gave me the final dataset that I needed to do the analysis. An example of this dataset can be seen below:

The first player I decided to analyze is Harry Maguire. He is an English defensive player that plays for the English team Manchester United. In most situations, people would play him in a center back defensive role. Users may want to replace him or find a player similar to his play style. When running his name through the cosine similarity analysis code, I was given the below players:

Top 10 Players Similar to harry maguire are:
name club_name overall_rating positions
584 joachim andersen Crystal Palace 79 CB
1852 steven nzonzi Konyaspor 74 CDM,CM
2128 ibrahima sissoko Strasbourg 74 CM,CDM
353 danilo pereira Paris Saint Germain 81 CB,CDM
467 david lopez Girona 80 CB,CDM
2715 tiemoue bakayoko Lorient 73 CDM,CM
471 oriol romeu FC Barcelona 80 CDM,CM
823 eric dier FC Bayern München 78 CB
158 guido rodriguez Real Betis 83 CDM
340 william carvalho Real Betis 81 CDM,CAM

The second player I decided to analyze is Joshua Kimmich. He is a German midfield player who plays for the German team Bayern Munich. Generally, he plays in a center defensive midfielder role. When running his name through the code, the following players were given:

Top 10 Players Similar to joshua kimmich are:
name club_name overall_rating positions
1889 pavel bucha Cincinnati 74 CDM,CM,CAM
702 stephen eustaquio Porto 78 CDM,CM
316 oleksandr zinchenko Arsenal 81 LB
1698 nicolas raskin Rangers 75 CM,CDM
119 koke Atletico Madrid 84 CDM,CM
165 marcelo brozović Al Nassr 83 CDM,CM
944 kiernan dewsbury-hall Leicester City 77 CM
139 alexis mac allister Liverpool 83 CM,CAM,CDM
105 ismael bennacer Milan 84 CDM,CM,CAM
110 rodrigo de paul Atletico Madrid 84 CM

The third and last player I decided to analyze is Robert Lewandowski. He is a Polish striker who plays for the Spanish team Barcelona. He usually plays in an attacking striker role. The top ten most similar players that I found were:

Top 10 Players Similar to robert lewandowski are:
name club_name overall_rating positions
361 alexandre lacazette Olympique Lyonnais 81 ST
423 andre silva Real Sociedad 80 ST
487 marko arnautović Inter 80 ST
322 enes ünal AFC Bournemouth 81 ST
1373 danny ings West Ham United 76 ST
1060 raul de tomas Rayo Vallecano 77 ST
15 karim benzema Al Ittihad 89 CF,ST
4 harry kane FC Bayern München 90 ST
722 mergim berisha TSG Hoffenheim 78 ST,CF
331 serhou guirassy VfB Stuttgart 81 ST

In each of the different comparisons done above, you can see that results differ based on which player you want to analyze. Players that are similar to an attacking player will be different from players similar to a defensive player. In the cases of Harry Maguire, Joshua Kimmich, and Robert Lewandowski, we can see that the players they are similar to can differ.

There were several limitations that came with this analysis. One of the biggest is that goalkeepers data in this dataset was not accurate. It seemed that when the website was scraped, several of the goalkeeper ratings were not done correctly. So I decided to filter out the goalkeepers who had columns that were wrong. In the future, it would be beneficial to have these goalkeepers included. There weren’t a significant number of rows dropped, but all of the rows dropped makes a difference in the similarity analysis done. The data I used is also a month old at this point, so the player ratings may have shifted. The differences wouldn’t be big, but it could make a difference in the analysis.

Github Link: https://github.com/ltwalsh/walshINST414Module3

Data Sources: https://github.com/prashantghimire/sofifa-web-scraper?tab=readme-ov-file and sofifa.com

--

--