Cluster analysis on FIFA 23 players (K-Means Clustering)

Luis Alvaro
INST414: Data Science Techniques
4 min readApr 1, 2023

EA SPORTS FIFA is undoubtedly one of the biggest and most popular games in the world. The mass success of this game is due to the sport itself since soccer is the most popular sport globally by far. Their constant yearly improvements and realism keep people at bay for the next release.

What also makes FIFA interesting is that we can see each player’s stats (sometimes these ratings/stats can be controversial). Therefore, this analysis aims to find soccer players with similar skill sets or features on FIFA23. The method we will use is K-Means clustering. We will use python as our tool for this analysis and the following packages/libraries: sklearn, pandas, and matplotlib.

Data Collection/Cleaning

Before starting the clustering analysis we need to download the pre-existing dataset from Kaggle, which provides information about a player’s age, wage, position, overall and potential attributes, league/club name, and many other features from the latest 8 editions of the game (FIFA 15 — FIFA 23).

Once we download the dataset, we read it with pandas. The dataset contained 110 columns, but we want to focus on a few relevant features like age, overall, potential, wage_eur, and value_eur. After filtering the columns of interest we wanted to check for missing data. we applied the isnull() function to the subsetted data frame and counted the number of nulls in each column which gave us the following output:

We see that 89 rows are missing data, for some reason some players do not have information about their wages and value so we decided to drop these rows. We did some further subsetting and modified our index to obtain the final matrix. Overall there weren’t any more issues others might find when dealing with the data.

Clustering

We can notice that columns ‘value_eur’ and ‘wage_eur’ have large values relative to the other columns we are working with. This can cause problems in the sense that they’ll be weighted more in our clustering. We don't want that we want every feature or column to be treated equally so we need to rescale our matrix with min/max normalization. (data.subtract(data.min()).divide(data.max()-data.min())*9) + 1 will rescale our matrix to a 1–10 scale for each column.

Since we are applying K-Means clustering, the similarity metric it will be using is euclidean distance. We will start our clustering with sklearn and choose K=5 clusters. We are trying to cluster players based on the 5 features mentioned above so it makes sense to choose 5 clusters. After creating our model and fitting it into our data we have the following results on the distribution of players in each cluster.

We can see that cluster 2 has the most players followed by cluster 5. Cluster 3 has the least players out of the 5 clusters made. However, what does this really tells us about the data? Generating the centroids for each cluster will allow us to understand better the underlying patterns and insights about each cluster and its players. The matrix below shows each cluster its average values per feature. Cluster 0 has older players who generally reached their potential, clusters 1 and 2 are similar, they have younger players who haven’t yet reached their potential, however, cluster 1 seems to have better overall rating players with higher potential and market/wage value. Cluster 2 seems like it’s where the star players lie, players in their prime who make the most money, being really close to their potential rating. Lastly, cluster 4 seems where the average players lie, where their overall potential is close, and are also older.

Overall, the limitation of this analysis is that we are picking k=5 as our best guess based on the number of features. This may not be the optimal number of clusters for this data, to do that we would need to apply the elbow method but for simplicity, we decided to analyze with k=5. Furthermore, bias could influence our clustering as we did not consider potential outliers in our dataset.

You can find the code for K-Means clustering here.

--

--