K-means for player clustering

9 min readOct 12, 2022

In this story, we are going to train an unsupervised machine learning algorithm called K-means.

K-means clustering is a method of vector quantization that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid).

In other words, K-means is able to find relationships in data and create groups of similar characteristics.

We will use a dataset (‘FB_stats.xlsx’) that I scraped from Fbref with game variables of different Big 5 league players to train K-means and group players based on similarity factors.

We start by importing the necessary libraries and uploading the Excel file(‘FB_stats.xlsx’).

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_excel('/content/FB_stats.xlsx')
df.head()

Let’s see all the position names to filter the dataset per position.

In this case, I want to train the K-means algorithm just for full-backs.

Full-backs are classified as DF, but the problem is that center backs are classified as DF as well, so I can’t filter for just full backs.

We will filter the dataset with (‘DF’,’DFFW’) positions in order to stay with defenders and more ofensive defenders.

positions = ['DF','DFFW']
df = df[df['Positions'].isin(positions)]

For our clustering, we will consider players with more than 90’s played, so let’s filter the dataset accordingly.


df = df[df['90s played'] >=5]

To simplify and not get an output with too many players, we will only analyze Premier League players.

Filtering the dataset by only Premier League players, we stay with only 71 players in the first dataset to analyze and apply the K-means algorithm.

Next, let’s filter the columns that are of interest to us for clustering.

df_eng = df[df['Competition'] == 'eng Premier League']

df_eng = df_eng[['Player','Shot Creating Actions',
       'Goal Creating Actions', 'Aerial duels Won%', 'xG/-90', 'On-Off',
       'Touches Mid 3rd', 'Touches Att 3rd', 'Touches Attacking Area',
       'Dribbles Succ', 'Success dribble%', 'Player dribbled past', 'Carries',
       'Total Distance', 'Progressive Distance', 'Progressive Carries',
       'Carries into final 3rd', 'Carries into Penalti Area',
       'Progressive passes received', 'Tackled dribbled%', 'Dribbled Past',
       'Sw (+40m width pass)', 'Ground passes', 'Low passes', 'High passes',
       'Completed passes into space between defenders', 'Completion total%',
       'TotDistance', 'PrgDistance', 'Completion short passes%',
       'Completion medium passes %', 'Completion long passes%', 'Assists',
       'Key Passes', 'Passes Penalti Area', 'Crosses Penalti Area',
       'Progressive passes', '90s played', 'Shot on target', 'npxG']]

All the metrics that I considered in this model are calculated per 90 minutes played. Below, we have the name of the column for each metric in the dataset and it’s definition.

’90s played’ — Minutes played divided by 90
‘TotDistance’ — Total distance in yards that completed passes have traveled in any direction
‘PrgDistance’ — Progressive passing distance — total distance, in yards, that completed passes have traveled towards the opponent’s goal. Note: Passes away from the opponent’s goal are counted as zero progressive yards.
‘Key Passes’ — passes that directly lead to a shot (assisted shots)
‘Passes Penalti Area’ — completed passes into the 18-yard box Not including set pieces
‘Crosses Penalti Area’ — completed crosses into the 18-yard box Not including set pieces
‘Progressive Passes’ — completed passes that move the ball towards the opponent’s goal at least 10 yards from its furthest point in the last six passes, or any completed pass into the penalty area. Excludes passes from the defending 40% of the pitch
‘Completed passes into space between defenders’ — Completed pass sent between back defenders into open space
‘Sw (+40m width pass)’ — Passes that travel more than 40 yards of the width of the pitch
‘Ground passes’
‘Low passes’ — Passes that leave the ground but stay below shoulder-level
‘High passes’ — Passes that are above shoulder-level at the peak height
‘Assists’
‘Progressive passes received’
‘Completion Total%’ — Pass Completion Percentage
‘Short completion passes %’ — Pass Completion Percentage, passes between 5 and 15 yards
‘Medium completion passes %’ — Pass Completion Percentage, passes between 15 and 30 yards
‘Long completion passes %’ — Pass Completion Percentage, passes longer than 30 yards
‘Aerial duels won %’
‘Tackled dribbled %’ — Percentage of dribblers tackled Dribblers tackled divided by dribblers tackled plus times dribbled past
‘Dribbled Past’ — Number of times dribbled past by an opposing player
‘Touches Mid 3rd’ — Touches in middle 1/3
‘Touches Att 3rd’ — Touches in attacking 1/3
‘Touches Attacking Area’ — Touches in attacking penalty area
‘Dribbles Succ’ — nº successful dribbles
‘Success dribble%’ — Percentage of dribbles completed successfully
‘Player dribbled past’ — Number of players dribbled past
‘Carries’ — Number of times the player controlled the ball with their feet
‘Total Distance’ — Total distance, in yards, a player moved the ball while controlling it with their feet, in any direction
‘Progressive Distance’ — Carrying progressive distance — Total distance, in yards, a player moved the ball while controlling it with their feet towards the opponent’s goal
‘Progressive Carries’ — Carries that move the ball towards the opponent’s goal at least 5 yards, or any carry into the penalty area. Excludes carries from the defending 40% of the pitch
‘Carries into final 3rd’ — Carries that enter the 1/3 of the pitch closest to the goal
‘Carries into Penalti Area’ — Carries into the 18-yard box
Shot on target
‘npxG’ — non-penalty expected goals per 90 minutes played
‘Shot Creating Actions’ — Goal-Creating Actions — The two offensive actions directly leading to a goal, such as passes, dribbles, and drawing fouls.

Note: A single player can receive credit for multiple actions, and the shot-taker can also receive credit.

'Goal-Creating Actions’ — similar to the previous metrics but for goal situations
‘On-Off’ — xG Plus/Minus Net — Net expected goals per 90 minutes by the team while the player was on the pitch minus net expected goals per 90 minutes by the team while the player was off the pitch.
xG+/-90 — xG Plus/Minus — Expected goals scored minus expected goals allowed by the team while the player was on the pitch per 90 minutes played.

Copy of df_eng to use later.

df_3d = df_eng.copy()

Remove player identifiers and normalize our data with MinMaxScaler.

from sklearn import preprocessing

player_names = df_eng['Player'].tolist() 

df_eng = df_eng.drop(['Player'], axis = 1) 

x = df_eng.values 
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)
X_norm = pd.DataFrame(x_scaled)

We use PCA here as a way to reduce our dimensions. Let’s scale it down to 2 dimensions so we can see the data in a 2D plot with the K-means results.

from sklearn.decomposition import PCA

pca = PCA(n_components = 2)
reduced = pd.DataFrame(pca.fit_transform(X_norm))
reduced.head()

Now that we’ve reduced the dimensions using PCA, we can plot our players and look for relationships in a 2D analysis.

However, this plot is just a plot of similarity between players since PC 1 and PC 2 are not useful variables for a technical team or a scouting team.

Now that we have a 2D similarity metric, we can apply K-means to create the similarity clusters, but first we need to know how many clusters we want.

To know the adequate number of clusters (K) we use a method called Elbow, applied below. We then create a loop that will train 11 K-means models (for different K’s) and remove WCSS for each of the trainings.

WCSS is defined as the sum of the squared distance between each member of the cluster and its centroid.

from sklearn.cluster import KMeans

wcss = [] 
for i in range(1, 11): 
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(reduced) 
    wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)
plt.xlabel('Numero de clusters (K)')
plt.ylabel('WCSS')

By the elbow method, we consider it K=6 (as the elbow of our WCSS graph), as it is perhaps the value from which WCSS stops dropping so fast.

So let’s train a k-means for K=6 and see how it behaves.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=6)
kmeans = kmeans.fit(reduced)

labels = kmeans.predict(reduced)
clusters = kmeans.labels_.tolist()

After training and using the trained model, each player in the dataset now has a number from 0 to 5 that represents their set. Let’s put this information in our reduced dataset and then plot it with the results of K-means.

reduced['cluster'] = clusters
reduced['name'] = player_names
reduced.columns = ['x', 'y', 'cluster', 'name']
reduced.head()

In this plot, each color represents a cluster.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set(style="white")

ax = sns.lmplot(x="x", y="y", hue='cluster', data = reduced, legend=False,
                   fit_reg=False, size = 15, scatter_kws={"s": 250})

texts = []
for x, y, s in zip(reduced.x, reduced.y, reduced.name):
    texts.append(plt.text(x, y, s,fontweight='heavy'))

ax.set(ylim=(-2, 2))
plt.tick_params(labelsize=15)
plt.xlabel("PC 1", fontsize = 20)
plt.ylabel("PC 2", fontsize = 20)
plt.title('KMeans clustering - Defenders',size=25,weight='heavy')

s ="@ricardoandreom\n"
date = datetime.today()
d = str(date.strftime('%Y-%m-%d'))
plt.text(-1.1,-2.3, s, fontdict=None, fontsize=12, fontweight='heavy')
plt.text(-1.1,-2.35, d, fontdict=None, fontsize=12, fontweight='heavy')
date = datetime.today()

plt.tight_layout()

Cluster 5

It’s always a personal opinion, but in this cluster are the best full backs in the Premier League at the start of the season.

Offensive, progressive and very good technical players

Cluster 4

This cluster is clearly representing a small set of the top center backs in the league, maybe it’s missing some of them.

Personally I’d expect to see Saliba in this cluster.

Interestingly, Kyle Walker, who is still a full-back, is in this cluster. However, Guardiola over the years has converted him into more of a center back, occupying this role several times.

Cluster 0

Above-average performing full backs

Cluster 3

Cluster 2

Cluster 1

3D K-means clustering

Now, let’s use df_3d and, in an analogous way, reduce to 3 dimensions using PCA. This way, we can plot our players and look for relationships in a 3D analysis.

from sklearn import preprocessing

player_names_3d = df_3d['Player'].tolist() 

df_3d = df_3d.drop(['Player'], axis = 1) 

x = df_3d.values 
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)
X_norm = pd.DataFrame(x_scaled)

from sklearn.decomposition import PCA

pca = PCA(n_components = 3)
df1 = pd.DataFrame(pca.fit_transform(X_norm))

import re, seaborn as sns
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap


%matplotlib inline
df1.columns = ['PC1','PC2', 'PC3']

ax = plt.figure().gca(projection='3d')

ax.scatter(df1['PC1'], df1['PC2'], df1['PC3'], s=40)

plt.show()

wcss = [] 
for i in range(1, 15): 
  kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
  kmeans.fit(df1) 
  wcss.append(kmeans.inertia_)

plt.plot(range(1, 15), wcss)
plt.xlabel('Numero de clusters (K)')
plt.ylabel('WCSS')

kmeans = KMeans(n_clusters=9)
kmeans = kmeans.fit(df1)

labels = kmeans.predict(df1)
clusters = kmeans.labels_.tolist()

df1['cluster'] = clusters
df1['name'] = player_names
df1.columns = ['x', 'y', 'z', 'cluster', 'name']
df1


from matplotlib import pyplot as plt
from matplotlib import animation
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

fig = plt.figure()
ax = Axes3D(fig)

def init():
    ax.scatter(df1['x'], df1['y'], df1['z'], marker='o', s=20, c=df1['cluster'], cmap=plt.get_cmap('Set1'), alpha=0.6)
    texts = []
    return fig,

def animate(i):
    ax.view_init(elev=10., azim=i)
    return fig,

anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=360, interval=20, blit=True)

anim.save('clusters_animation.mp4', fps=60, 
extra_args=['-vcodec', 'libx264'],dpi=300)

It’s really interesting how we look at the outputs, and most of them make perfect sense. It's a nice way to search the market for a player with a similar profile to the one we intend to replace.

Now, it is part of our job to analyze the clusters and discover what these players have in common to be in the same cluster.

I really hope you enjoyed your reading and find it useful!

If you enjoyed my article and feel like giving a little something back, feel free to buy me a virtual coffee :), link here.

Follow me and my content on:
Linkedin
Tableau
Portfolio page
Football analytics Twitter
Football analytics Instagram

K-means for player clustering

3D K-means clustering

Written by Ricardo André