There are 3 different types of attackers in Ligue 1

Remi Awosanya
5 min readFeb 28, 2023

…at least that’s what an application of K-Means Clustering analysis tells us. Based on player output in the 2022/23 season, all forwards in the French top flight can be classified into 1 of 3 different groups. The proceeding will walk you through an application of K-Means Clustering in Python (snippets of code included). Upon defining the clusters, I will then look at the characteristics of each cluster and identify which cluster players fall into. Finally, I will create a Classification model to predict which cluster a player is likely to fall into. The data used can be found on FBref.com.

Once data has been scraped from FBref.com and cleaned, the process starts by identifying the optimal number of clusters based on the player data. Players included have played in any forward position for more than the average amount of minutes in Ligue 1 this season. (The plan is to use forwards that have played below the average number of minutes as test data in our Classification model later on). Some of the columns in the dataset include; goals per 90, assists per 90, progressive carries, touches in the box and key passes. To note, Python’s Sci-Kit Learn package can speed up the process but I have chosen to manually write the clustering algorithm.

features = list(forw.columns[5:])
forw = forw.dropna(subset=features)
data = forw[features].copy()

# find the optimal number of clusters to use
for i in range(1, 7):
kmeans = KMeans(n_clusters=i, random_state=0)
kmeans.fit(data)

# wcss = cluster sum of squares
wcss = []

for i in range(1, 7):
kmeans = KMeans(n_clusters=i, random_state=0)
kmeans.fit(data)
wcss.append(kmeans.inertia_)

sns.set()
plt.plot(range(1, 7), wcss)
plt.title("Selecting the Number of Clusters using the Elbow Method")
plt.xlabel("Clusters")
plt.ylabel("WCSS")
plt.show()

kl = KneeLocator(range(1, 7), wcss, curve="convex", direction="decreasing")
elbow = kl.elbow
print(elbow)

Using Python’s Kneed package, the optimal number of clusters is 3. The rest of the process is as follows:

  1. Scaling data points as the columns have different ranges — I have used the Min-Max Scaling method on all columns in the range (1,10):
data = (data - data.min()) / (data.max() - data.min()) * 9 + 1

2. Select random centroid values (cluster centers), number of centroids = number of clusters by definition

3. Calculating the shortest Euclidean distance from each point and one of the randomly selected cluster centers:

def get_labels(data,centroids):
distances = centroids.apply(lambda x: np.sqrt(((data - x) **2).sum(axis=1)))
return distances.idxmin(axis=1)

4. Iterating to find random initialized centroids that minimize the the Sum of Squared Error (SSE):

max_iterations = 100
k = 3
centroids = random_centroids(data,k)
old_centroids = pd.DataFrame()
iteration = 1

while iteration < max_iterations and not centroids.equals(old_centroids):
old_centroids = centroids
labels = get_labels(data, centroids)
centroids = new_centroids(data,labels,k)
iteration += 1

print(centroids)

Now we have our clusters, we can begin to analyse their characteristics.

Initially we can see very little differences in the number of players in each cluster and their respective average ages. Nonetheless, players in cluster 1 are slightly older and play more — indicating a propensity to trust the experienced forwards.

Taking a deeper look:

Here we can see further cluster characteristics. Players in cluster 1 have recorded more Progressive Passes, Progressive Carries and Touches in the box per game. Cluster 1 players perform very well in both progressive actions and touches in the box. Cluster 2 players haven’t recorded a lot of progressive actions but a relatively high number of touches in the box.

Forwards are judged on goals, so let’s take a look:

Cluster 2 has the superior inter-quartile range indicating that the spread of the middle 50% of values is largest of the three. Cluster 1 has the highest maximum value barring the outlier in cluster 2.

Next we take a look at which players fall into each of the clusters, again looking at the distribution of goals per game:

Any surprises?

Finally, we can build a classification model that predicts which cluster a player not included in the dataset is likely to fall into, based on output.

# clusters column to dataset
forw["clusters"] = labels

# define target(y), features(x)
feat = list(abc.columns[5:31])
x = forw[feat]
y = forw["clusters"]

# split the data
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=1)

# random forest model creation
rfc = RandomForestClassifier()
model = rfc.fit(X_train,y_train)

# predictions
y_pred = rfc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: %.2f' % (accuracy*100))
print(classification_report(y_test, y_pred))

Due to the stochastic nature of the classification algorithm (data is split randomly), the steps above are run 5 times giving accuracy scores of [77.42, 83.87, 83.87, 83.87, 80.65]. An acceptable average accuracy score of 81.9%.

Recall, our initial dataset has been filtered to players who have played above the average number of minutes in the French top flight. A player I like is Bradley Barcola of Lyon who hasn’t played a lot this season. Let’s see which cluster he is assigned to:

def get_player(i):
df = full_data[(full_data["player"] == i)]
num_cols = df.columns[5:31]
df = df[num_cols]
print(rfc.predict(df))

get_player("Bradley Barcola")

Bradley Barcola falls into cluster 1 — progressive players who get lots of touches in the box and have a relatively high goals-game ratio.

If you’ve made it this far, any and all scrutiny is not only welcome but encouraged. For contact, email me: remiawosanya8@gmail.com

--

--