Finding Player Similarity Based on Skill Ratings in EA Sports FC

Published in

INST414: Data Science Techniques

7 min readApr 19, 2024

EA Sports FC (EA FC), also known as EA Sports FIFA, is one of the highest played soccer video games. This game hosts a multitude of game modes, including things like quick play, career mode, and ultimate team. Ultimate team is the game mode that this analysis will apply to the best. Players in EA FC are given ratings based on their skills and success at the beginning of the season. This rating can fluctuate depending how they do week by week. In the ultimate team game mode, users buy their own team with virtual money that they gain from playing games against other users and by completing challenges. So knowing players and their ratings is important. But sometimes it can be hard to know who to buy if you are a beginner, so having a way to find similar players would be helpful. This is where my similarity analysis can help. The question it can help answer is: given a player and their skill ratings, what are the top ten most similar players to that player? The specific stakeholder that this question can help are players of EA FC. If a person wants to replace or add a similar player on their team, this analysis will give them a way to make an informed decision.

The data that I need to answer this question is a recent snapshot of all of the players and their ratings in EA FC. The features that I would need are the players name, country, club/team, and all of their skill ratings. These ratings are measured 0–100, and rate things such as passing, shooting, agility, and strength. The more skills we have, the more accurate the analysis will be. So having many different skill ratings will allow our similarity analysis to have more to compare. This data is relevant to my question because it gives us the numerical data needed to do a similarity analysis.

The final dataset that I used actually came from a user on GitHub that used a web scraper to get data from a source that had all of the player ratings and information. The original data source is a website called sofifa.com, which has all of the player information that is used in the actual EA FC game. I needed a recent snapshot of the players ratings, and this dataset includes data from March 20th, 2024, so it is still fairly recent and accurate. The features included categorical information like the players name, country, club/team, position, and play styles. The numerical data included more than 30 ratings of the players abilities. These included dribbling, volleys, crossing, and reactions. This came in a csv file that can be loaded straight into a pandas data frame for analysis.

To measure similarity between points, I will be using cosine similarity. Cosine similarity allows for us to find similarity based on the angle between points rather than only the distance. This means that it is robust to differences in scale. This is important in player comparisons because there are players like Erling Haaland and Kylian Mbappe, who have very high ratings. Instead of just finding other players with very high ratings, we can find players that are similar based on their actual skills ratings and how similar they are to the given players ratings. This should allow me to find players in a way that lowers the influence of differences in scale. To do this, I used the cosine similarity functions from sklearn. This library allowed for me to easily perform the cosine similarity calculations.

It took several steps to clean this dataset before I could do the analysis on it. First, I had to drop all of the rows that I didn’t need. There were 33 rows that I didn’t need to do the analysis, so I dropped those from the pandas dataset. This included features we don’t need for analysis like pictures, logos, body type, and other variables that wouldn’t be useful to the analysis.

final_df = player_df.drop(columns=['full_name', 'version','description', 'image', 'potential', 'wage', 'preferred_foot', 'international_reputation', 'work_rate',
                                    'body_type', 'real_face', 'release_clause', 'specialities', 'club_id', 'club_logo', 'club_rating', 
                                    'club_kit_number', 'club_joined', 'club_contract_valid_until', 'country_id', 'country_league_id', 'country_league_name',
                                    'country_flag', 'country_rating', 'country_position', 'country_kit_number', 'play_styles', 'gk_diving', 'gk_handling', 'gk_kicking', 'club_position'])

Next, I changed some of the values in the player name and club columns. I dropped duplicates in the player name column and made all of the names lowercase to make them easier to search. I then created a list of accented and special characters that are used in other languages and replaced them with their corresponding English letters.

#make the player names lowercase to make it easier to search
final_df['name'] = final_df['name'].str.lower()

#player_df = player_df.drop_duplicates()
final_df.drop_duplicates(subset=['name'])


#remove accented and other special characters from the players names. These were some of the most common ones I saw. 
mapping = {'á': 'a',
           'é': 'e',
           'í': 'i',
           'ó': 'o',
           'ú': 'u',
           'ñ': 'n',
           'ã': 'a',
           'ë': 'e',
           'ş': 's',
           'ă': 'a'}

final_df['name'] = final_df.name.replace(mapping, regex=True)
final_df['club_name'] = final_df.club_name.replace(mapping, regex=True)

Lastly, I noticed that the web scraping process may have made mistakes in the gk_reflexes and gk_positioning columns. These rows pulled categorical data from the row next to them called play_styles, that includes words. So I filtered out the rows that did not contain numbers, which dropped the goalkeepers with corrupted values.

final_df = final_df[pd.to_numeric(final_df['gk_positioning'], errors='coerce').notnull()]
final_df = final_df[pd.to_numeric(final_df['gk_reflexes'], errors='coerce').notnull()]

This gave me the final dataset that I needed to do the analysis. An example of this dataset can be seen below:

The first player I decided to analyze is Harry Maguire. He is an English defensive player that plays for the English team Manchester United. In most situations, people would play him in a center back defensive role. Users may want to replace him or find a player similar to his play style. When running his name through the cosine similarity analysis code, I was given the below players:

Top 10 Players Similar to harry maguire are:
                  name            club_name  overall_rating positions
584   joachim andersen       Crystal Palace              79        CB
1852     steven nzonzi            Konyaspor              74    CDM,CM
2128  ibrahima sissoko           Strasbourg              74    CM,CDM
353     danilo pereira  Paris Saint Germain              81    CB,CDM
467        david lopez               Girona              80    CB,CDM
2715  tiemoue bakayoko              Lorient              73    CDM,CM
471        oriol romeu         FC Barcelona              80    CDM,CM
823          eric dier    FC Bayern München              78        CB
158    guido rodriguez           Real Betis              83       CDM
340   william carvalho           Real Betis              81   CDM,CAM

The second player I decided to analyze is Joshua Kimmich. He is a German midfield player who plays for the German team Bayern Munich. Generally, he plays in a center defensive midfielder role. When running his name through the code, the following players were given:

Top 10 Players Similar to joshua kimmich are:
                       name        club_name  overall_rating   positions
1889            pavel bucha       Cincinnati              74  CDM,CM,CAM
702       stephen eustaquio            Porto              78      CDM,CM
316     oleksandr zinchenko          Arsenal              81          LB
1698         nicolas raskin          Rangers              75      CM,CDM
119                    koke  Atletico Madrid              84      CDM,CM
165        marcelo brozović         Al Nassr              83      CDM,CM
944   kiernan dewsbury-hall   Leicester City              77          CM
139     alexis mac allister        Liverpool              83  CM,CAM,CDM
105         ismael bennacer            Milan              84  CDM,CM,CAM
110         rodrigo de paul  Atletico Madrid              84          CM

The third and last player I decided to analyze is Robert Lewandowski. He is a Polish striker who plays for the Spanish team Barcelona. He usually plays in an attacking striker role. The top ten most similar players that I found were:

Top 10 Players Similar to robert lewandowski are:
                     name           club_name  overall_rating positions
361   alexandre lacazette  Olympique Lyonnais              81        ST
423           andre silva       Real Sociedad              80        ST
487      marko arnautović               Inter              80        ST
322             enes ünal     AFC Bournemouth              81        ST
1373           danny ings     West Ham United              76        ST
1060        raul de tomas      Rayo Vallecano              77        ST
15          karim benzema          Al Ittihad              89     CF,ST
4              harry kane   FC Bayern München              90        ST
722        mergim berisha      TSG Hoffenheim              78     ST,CF
331       serhou guirassy       VfB Stuttgart              81        ST

In each of the different comparisons done above, you can see that results differ based on which player you want to analyze. Players that are similar to an attacking player will be different from players similar to a defensive player. In the cases of Harry Maguire, Joshua Kimmich, and Robert Lewandowski, we can see that the players they are similar to can differ.

There were several limitations that came with this analysis. One of the biggest is that goalkeepers data in this dataset was not accurate. It seemed that when the website was scraped, several of the goalkeeper ratings were not done correctly. So I decided to filter out the goalkeepers who had columns that were wrong. In the future, it would be beneficial to have these goalkeepers included. There weren’t a significant number of rows dropped, but all of the rows dropped makes a difference in the similarity analysis done. The data I used is also a month old at this point, so the player ratings may have shifted. The differences wouldn’t be big, but it could make a difference in the analysis.

Github Link: https://github.com/ltwalsh/walshINST414Module3

Data Sources: https://github.com/prashantghimire/sofifa-web-scraper?tab=readme-ov-file and sofifa.com

Finding Player Similarity Based on Skill Ratings in EA Sports FC

Written by Luke Walsh