Using Classification to Predict the Cluster of EA Sports FC Players

Luke Walsh
INST414: Data Science Techniques
12 min readMay 14, 2024

The intersection between sports analytics and data science can allow us to ask and solve fascinating questions using data. In the case of soccer, it can allow us to investigate things that may not be able to be seen just by looking at the numbers. Using player skill ratings taken from the EA Sports FC soccer video game, we can find insights about which players may be similar to each other and put them into clusters. In this analysis, we will be answering two questions. The first question we are aiming to answer is: can clear clusters of players emerge through the comparison of player skill ratings? The second question is: with these clusters, can we predict which cluster a player is in using classification?

The main stakeholders of the first question are players of the game who are playing one of the game’s several team building modes. In these game modes, users attempt to build the best team possible with the best ratings. With these clusters, users will be able to make informed decisions about which players to put into their team. Each cluster should represent a position or skill level of the player, so users should be able to choose a player from one of these clusters based on their needs.

For the second question, this analysis will be helpful to developers who may be adding new players to the game or are developing the next EA Sports FC game. Users could also see which cluster a new player could belong to. If a player is added to the game, or if their rating increases, users could see the cluster of the player. For developers, this will help them make decisions on any type of clusters they may use and how to assign their players to them. The game places the players into different play styles, and having the ability to predict and decide which style the player belongs to could reduce the effort needed to develop the game.

The data that would be needed to answer this question is a dataset that contains player names and their skill ratings. In EAFC, players are given ratings 0–100 on many different attributes including passing, shooting, and speed. Games are released every year, so players of the game get new ratings yearly. Most active players around the world are given ratings, and these ratings are associated with a players card, which looks like this:

We only see six ratings on the card, but there are many more skills that are rated, and we would need these ratings to be able to do a thorough comparison. Having all of these rated attributes would answer our question because with these numbers, we will be able to create clusters of players. These clusters will then show us if there is any pattern or common trait between the players in each different cluster. After we assign these clusters to the players, this should give us what we need to perform classification and see how well a model can predict what cluster a player is part of.

The subset of data that fulfilled these requirements came from GitHub. I was able to find someone that created a web scraper that pulled all of the players’ information from a website that had each player’s rating numbers and descriptions. This dataset included thousands of players and over 30 different ratings for each of them. These ratings included attributes like dribbling, sprint speed, passing, and stamina. It also had the player’s team, home country, league, and several other categorical variables. These categorical variables would not be helpful in the cluster analysis or classification, but they become helpful in the end when we try to figure out what each cluster represents. This dataset was last updated on March 20th, 2024, so it is fairly recent and accurate to how players are rated. All of this information was captured in a csv file that would be used for the analysis.

Clustering

To measure the similarity between points, I will be using KMeans clustering. This type of clustering assigns data points to one of the clusters based on their distance to the nearest centroid, or the center point of the cluster. This will eventually create clusters based on the k value, which is how many clusters you want to have.

To find this K value, I decided to use the elbow method. With this method, we can visualize the curve of optimal k values. In this graph, the line will rapidly change, and at the elbow of this change, we will find the k value. The elbow graph that I created with this data can be seen below:

As you can see, there is a defined elbow in the graph starting around the k value of 4. In this case, I will choose to use the k value of 6. With this value, we will create 6 clusters of players. The hope is that 6 clusters will give us the most accurate representation of similar groups of players.

Our next step is to do the actual clustering. First, we scaled the data to make it a standard scale. Then, we set the number of clusters to 6 and fit the clusters on the data based on the kmeans method. Finally, we created the labels and put them on their corresponding rows. Now we have the full dataset with each player’s cluster. But this doesn’t show us much. To get a real look at each of the clusters, I got each cluster and the average of all of the players ratings for a few different attributes. The attributes included things like overall rating, attacking skills, and defending skills. Below, you can see the clusters and their average skill values:

This gives us a good idea of what types of players may be in each cluster. But to have a better look, let’s look at the top five players based on overall rating in each cluster.

Cluster 1: Low value midfielders

If you’re just starting out, these low value players may be for you. Players in this cluster are average all around, with better offensive stats like shooting or shot power. The top players in this cluster play positions in the midfield and have low values compared to the rest of the clusters. These would be perfect if you are new and lack money, but need well rounded players.

Cluster 2: Average Defensive Players

Players in this cluster have relatively strong defensive ratings. But offensive skills like finishing and shot power are low. This tells us that these are more defensive players. When we look at the top five players from this cluster, we see that four of them are center backs, and one is a goalkeeper. Other than the goalkeeper who is high value, these players have relatively low values. Overall, this shows us that in cluster 1, these are defensive players that may cost less than the powerhouse defensive players.

Cluster 3: Goalkeepers

Players in this cluster have high goalkeeper attributes. The main goalkeeper skills we used in this analysis were goalkeeper reflexes and goalkeeper positioning. When looking at the top players in this cluster, we see the best goalkeepers. These are high value players, but they are the best in the world. People like Alisson and Ederson are who you want guarding your goal.

Cluster 4: Defensive Stars

If you are looking for players with strong defensive skills, cluster 3 is where you want to look. These players showcase strong defensive awareness and tackling. They also boast powerful stamina and strength. If we look at the top five players in this cluster, we see people like Ruben Dias, Eder Militao, and Marquinos, who are all known to be some of the best in the defensive half. These players have high value, so if you have money and want a player who can stop the ball, people from this cluster are your best bet.

Cluster 5: All around beasts

These players have strong offensive attributes and decently strong defensive attributes. The averages show us that this cluster has high overall ratings and strong ratings in almost all of the other skill areas. This means that these players are most likely some of the best players in the world. This is shown to be true when looking at the top five players. Players like Kylian Mbappe, Erling Haaland, and Kevin De Bruyne are three of the best players in the world. They have very high ratings in most attributes. These players are expensive, but can carry any team to a win.

Cluster 6: Average offensive players

If price is a problem for you, and you’re looking for an offensive player, this cluster may be for you. The players in these clusters have higher offensive skill ratings, but they differ in price compared to the powerhouse players found in cluster five. Attributes like shooting, shot power, and sprint speed are high. These players are less well rounded, so they come at a lower price.

Classification

With these assigned clusters, we can now perform classification to predict which cluster a player may be part of. This will build on the methods of clustering done in module 4. This will be utilizing the methods learned in module 6, about classification. I was able to find distinct clusters of players in module 4. Having these clusters gives us the perfect foundation to use classification to predict the cluster that a player belongs to, which was covered in module 6.

In this case, I will be using K-Nearest Neighbors (KNN) Classification. In this classification model, we look at a single point and look at the number of nearest neighbors to that point. This number of neighbors is characterized by the letter k. In these neighbors, which all have an assigned cluster label, the most common cluster label is found and is assigned to this new point. The reason I chose KNN is because it makes no assumptions about if the data is linear or not. The soccer player ratings may not have a linear correlation, so taking out this assumption all together will be helpful. It is also good with large datasets, which the one we are using is quite large.

After I perform the classification, I will want to know if the model is accurate or not. To test this, I will use the model’s f-score. F-score combines precision and recall, and can give us a balanced and accurate rating of how the model performed. This shows a value that is 0–1, where values closer to one are better.

To do this classification and to test the f-score, all we need to do is shown below:

After the KNN classification was done, we can see that the f-score is .96. This means that the model is fairly accurate at predicting which cluster a player belongs to. While doing the calculation to get this f-score, I had to adjust the model’s k value several times. When I got to the k value of 13, the f-score flattened out around .96, so I decided to stay with this value of 13. Given this f-score, we can assume that the model can mostly predict which cluster a player is part of. But there is still a margin of error that is pretty big once we look at the size of the dataset. To understand these errors, let’s look at some examples of what the model predicted wrong.

  1. Predicted to be in cluster 4 (Defensive Star), but was actually in cluster 5 (All Around Beast)

In this example, we can see that this player has high skill ratings all around. This was most likely classified wrong because of the players highly rated defensive skills. Most players who don’t play defensive positions dont have particularly good defensive skills. This player had high defensive awareness and tackling, so a case could be made to group them with the defensive players. But when you look at all of the other ratings, you see that this is a well rounded player with relatively high ratings. This means that they would most likely fit best in the all around beasts cluster.

2. Predicted to be in cluster 5 (All Around Beasts), but was actually in cluster 6 (Average Offensive Players).

When looking at the offensive skills for this example, we can see high ratings in most attributes. This would make them a good candidate for the all around beasts category. There are relatively high ratings that cover a range of types of skills. But when we look closer, we see some low attributes that make them stand out. This player has very low rated defensive stats and low ratings in things like strength, aggression, and interceptions. It makes sense that this player was predicted to be in cluster 5, but due to certain low ratings, they belong in the average offensive players category.

3. Predicted to be in cluster 5 (All Around Beasts), but was actually in cluster 1 (Low Value Midfielder).

In this example, we see that this player has a range of high ratings and low ratings. The player has a good distribution of high ratings in different types of skills. Things like physical ratings, passing, and shooting all have a few higher ratings. This means that the player could be seen as a good all around player. But the low ratings drag the player down too much to be able to be in the all around beasts category. They do possess good midfielder qualities, which include passing and ball control. These low ratings and adequate midfielder ratings would make them a better fit for cluster 1 (Low Value Midfielder).

What we can gather from these examples is that there are some combinations of ratings that the model may classify wrong. In all of the examples, there can be a case made for both clusters, but there are usually key features that the model misses or doesn’t consider. There are also players that could fit into more than one cluster, so in the end the model sometimes gets this choice wrong. In future models, these classification errors could be worked on or fixed.

Conclusion

These clusters did answer my first question. From this data, we were able to get clear clusters that showed distinct classes of players. This information could be beneficial to someone who may want to look for a certain type of player, but they aren’t sure where to start looking. Users of EA FC could use this in a game mode where they are building a team. Maybe they want a strong attacking player and average defensive player but they don’t know who to choose or where to look. They could look at these clusters and choose the players that work best for them.

My second question was also answered. Once we did the clustering, classification was successfully used to predict what cluster a player belonged to. We used K-nearest neighbor classification which yielded an f-score of .96. This high f-score means that the model can accurately predict the cluster of a given value. Users or developers of the game could take new players in the game, predict which cluster they are in, and make in-game decisions based on this cluster. Maybe a player gets a new, higher rating, will they change clusters? Or will they stay in the same cluster? This classification model could help with these questions.

The main limitation of this study was that not all of the goalkeeper data was accurate. When web scraping, it seems that some of the values for two of the goalkeeper attributes were wrong. In the end, I decided to drop these rows so I could still include the goalkeepers in the calculation. Even when this was done, I was able to get good calculations. But to get the most accurate clusters, these features could have helped. This could have also affected the classification. If all of the data was there, the model could have done a better job of predicting the players cluster labels.

Github Link: https://github.com/ltwalsh/INST414FinalProject

Data Sources: https://github.com/prashantghimire/sofifa-web-scraper?tab=readme-ov-file and sofifa.com

--

--