Clustering EA Sports FC Players

Luke Walsh
INST414: Data Science Techniques
8 min readApr 26, 2024

The intersection between sports analytics and data science can allow us to ask and solve fascinating questions using data. In the case of soccer, it can allow us to investigate things that may not be able to be seen just by looking at the numbers. Using player skill ratings taken from the EA Sports FC soccer video game, we can find insights about which players may be similar to each other and put them into clusters. The question we are aiming to answer is: can clear clusters of players emerge through the comparison of player skill ratings? The main stakeholders of this analysis are players of the game who are playing one of the game’s several team building modes. In these game modes, users attempt to build the best team possible with the best ratings. With these clusters, users will be able to make informed decisions about which players to put into their team. Each cluster should represent a position or skill level of the player, so users should be able to choose a player from one of these clusters based on their needs.

The data that would be needed to answer this question is a dataset that contains player names and their skill ratings. In EAFC, players are given ratings 0–100 on many different attributes including passing, shooting, and speed. There are many more skills that are rated, and we would need these ratings to be able to do a thorough comparison. Having all of these rated attributes would answer our question because with these numbers, we will be able to create clusters of players. These clusters will then show us if there is any pattern or common trait between the players in each different cluster.

The subset of data that fulfilled these requirements came from github. I was able to find someone that created a web scraper that pulled all of the players’ information from a website that had each player’s rating numbers and descriptions. This dataset included thousands of players and over 30 different ratings for each of them. These ratings included attributes like dribbling, sprint speed, passing, and stamina. It also had the player’s team, home country, league, and several other categorical variables. These categorical variables would not be helpful in the cluster analysis, but they become helpful in the end when we try to figure out what each cluster represents. This dataset was last updated on March 20th, 2024, so it is fairly recent and accurate to how players are rated. All of this information was captured in a csv file that would be used for the analysis.

Several steps were taken to clean this data. First, we had to filter out all of the columns that weren’t needed. This included categorical information like player league, country, and skill sets. This left the information needed for the calculations and end analysis. There were also errors in a few of the goalkeeper columns, so these were removed to make the analysis more accurate. This finally gave us the dataset we needed to perform the cluster analysis.

To measure the similarity between points, I will be using KMeans clustering. This type of clustering assigns data points to one of the clusters based on their distance to the nearest centroid, or the center point of the cluster. This will eventually create clusters based on the k value, which is how many clusters you want to have.

To find this K value, I decided to use the elbow method. With this method, we can visualize the curve of optimal k values. In this graph, the line will rapidly change, and at the elbow of this change, we will find the k value. The elbow graph that I created with this data can be seen below:

As you can see, there is a defined elbow in the graph starting around the k value of 4. In this case, I will choose to use the k value of 6. With this value, we will create 6 clusters of players. The hope is that 6 clusters will give us the most accurate representation of similar groups of players.

Our next step is to do the actual clustering. First, we scaled the data to make it a standard scale. Then, we set the number of clusters to 6 and fit the clusters on the data based on the kmeans method. Finally, we created the labels and put them on their corresponding rows. Now we have the full dataset with each player’s cluster. But this doesn’t show us much. To get a real look at each of the clusters, I got each cluster and the average of all of the players ratings for a few different attributes. The attributes included things like overall rating, attacking skills, and defending skills. Below, you can see the clusters and their average skill values:

This gives us a good idea of what types of players may be in each cluster. I also found the top five players in each cluster, which can give us examples of who are in the clusters. Below are the clusters, what they may show, and the top five players in each.

Cluster 1: Low value midfielders

If you’re just starting out, these low value players may be for you. Players in this cluster are average all around, with better offensive stats like shooting or shot power. The top players in this cluster play positions in the midfield and have low values compared to the rest of the clusters. These would be perfect if you are new and lack money, but need well rounded players.

Cluster 2: Average Defensive Players

Players in this cluster have relatively strong defensive ratings. But offensive skills like finishing and shot power are low. This tells us that these are more defensive players. When we look at the top five players from this cluster, we see that four of them are center backs, and one is a goalkeeper. Other than the goalkeeper who is high value, these players have relatively low values. Overall, this shows us that in cluster 1, these are defensive players that may cost less than the powerhouse defensive players.

Cluster 3: Goalkeepers

Players in this cluster have high goalkeeper attributes. The main goalkeeper skills we used in this analysis were goalkeeper reflexes and goalkeeper positioning. When looking at the top players in this cluster, we see the best goalkeepers. These are high value players, but they are the best in the world. People like Alisson and Ederson are who you want guarding your goal.

Cluster 4: Defensive Stars

If you are looking for players with strong defensive skills, cluster 3 is where you want to look. These players showcase strong defensive awareness and tackling. They also boast powerful stamina and strength. If we look at the top five players in this cluster, we see people like Ruben Dias, Eder Militao, and Marquinos, who are all known to be some of the best in the defensive half. These players have high value, so if you have money and want a player who can stop the ball, people from this cluster are your best bet.

Cluster 5: All around beasts

These players have strong offensive attributes and decently strong defensive attributes. The averages show us that this cluster has high overall ratings and strong ratings in almost all of the other skill areas. This means that these players are most likely some of the best players in the world. This is shown to be true when looking at the top five players. Players like Kylian Mbappe, Erling Haaland, and Kevin De Bruyne are three of the best players in the world. They have very high ratings in most attributes. These players are expensive, but can carry any team to a win.

Cluster 6: Average offensive players

If price is a problem for you, and you’re looking for an offensive player, this cluster may be for you. The players in these clusters have higher offensive skill ratings, but they differ in price compared to the powerhouse players found in cluster five. Attributes like shooting, shot power, and sprint speed are high. These players are less well rounded, so they come at a lower price.

These clusters did answer my main question. From this data, we were able to get clear clusters that showed distinct classes of players. This information could be beneficial to someone who may want to look for a certain type of player, but they aren’t sure where to start looking. Users of EA FC could use this in a game mode where they are building a team. Maybe they want a strong attacking player and average defensive player but they don’t know who to choose or where to look. They could look at these clusters and choose the players that work best for them.

The main limitation of this study was that not all of the goalkeeper data was accurate. When web scraping, it seems that some of the values for two of the goalkeeper attributes were wrong. In the end, I decided to drop these rows so I could still include them in the calculation. In the end, I was still able to get a good calculation. But to get the most accurate clusters, these features could have helped.

Github Link: https://github.com/ltwalsh/walshINST414module4

Data Sources: https://github.com/prashantghimire/sofifa-web-scraper?tab=readme-ov-file and sofifa.com

--

--