Can a Machine Learning algorithm determine the key components of an NBA championship team?

Published in

CodeX

6 min readSep 10, 2023

Introduction

In the last five years, basketball fans were lucky enough to see a different team win the NBA season. We have also witnessed blockbuster trades to build the next championship team. Some of these trades were a success, others ended up becoming memes. As the sport of basketball is becoming more global, we can see such a depth of talent in today’s NBA. This makes it difficult to predict the next team that will lift the Larry O’Brien trophy. The question which comes to my mind is, what are the components of a championship team?

With today’s technology, sports data has become more extensive and accessible. Not only can we access the regular per game stats, there’s even advanced stats on a player’s effectiveness on the court and shooting efficiency. Such stats can help us understand the profile of each player, whether it’s the All-Star or role player. Using the right techniques, generating insights from this data will help GMs and coaches understand what their team currently needs, whether it’s a sharp shooter, a tall rim-protector, a bench scorer, a veteran playmaker, or an all-around superstar. Considering the amount of available data, my second question is: can a machine learning algorithm help determine the components of a championship team?

In this article, I will share my attempt to use an unsupervised machine learning model to explore the components of each NBA champion the past 5 years in hope to find a pattern that shows the winning formula that can be adapted by other teams. Besides the different championship teams, I limited the project to 5 years to exclude the Durant era Warriors as the insane power of that team did not make sense.

Methodology

Data collection

The data used for this project is fully sourced by basketball-reference.com as it provides complete statistics (both regular and postseason) and player attributes. The main data points are:

Physical attributes: this is basically the player’s height and weight
Background and experience: age, country of origin, and years in the NBA
Basic per game stats: minutes, points, rebounds, assists, blocks, steals, etc
Advanced stats: efficiency
Shooting stats: shooting percentage by range

Model selection

Considering both personal bandwidth and familiarity, I have chosen k-means as the model for this project. This algorithm will identify k number of centroids to then cluster players into k number of groups based on similarities. After applying the model on the dataframe, I interpret the clusters into a player segmentation by extracting its top features. In other words, these clusters should describe a player’s profile and the size of each profile.

Finding the pattern

The next step is visualizing each team’s player distribution by cluster. In other words, we will see how many players each team has in each cluster. The expectations are:

If there are a similarity, we can determine the components of a championship team
If not, we can see different strategies used by each team

Limitations

The main limitations of this project are as follows:

The cluster definitions might be vague and debatable since it is based on data features
The data does not contain features that can only be seen by actually watching basketball, such as a player’s unique playing style.

Implementation and Analysis

Model Development

After creating and cleaning the data frame, I ran a k-means model using scikit-learn with the code as below:

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Standardize the data
df_scaled = (df - df.mean()) / df.std()

# Perform k-means clustering
kmeans = KMeans(n_clusters=5, random_state=0)
kmeans.fit(df_scaled.fillna(0))

# Add the cluster labels to the DataFrame
df['Cluster'] = kmeans.labels_

# Print the number of samples in each cluster
print(df['Cluster'].value_counts())

Ideally, we run an elbow test to determine the number of clusters and the feasibility of k-means. Due to limitations, this number was selected based on the intuitiveness of the distribution output. For this case, the magic number is 5.

The next step is understanding the top features of each cluster for interpretation, which were generating using the code below:

# Get the cluster centers
centers = kmeans.cluster_centers_

# Get the top features for each cluster
n_top_features = 6
top_features = []
for center in centers:
    top_feature_indices = np.argsort(center)[-n_top_features:][::-1]
    top_features.append(df.columns[top_feature_indices])

# Print the top features for each

for i, cluster_features in enumerate(top_features):
    print(f"Cluster {i} top features: {', '.join(cluster_features)}")

Based on the results and rechecking the data frame, the 6 clusters are interpreted as the following:

Cluster 1: Strong role players for defence and offensive setup
Cluster 2: All-Star to Superstar level players
Cluster 3: Efficient play-off veterans
Cluster 4: Defensive off-the-bench big men
Cluster 5: Benchwarmers

Visualization and Analysis

Using plotly, I created a simple bar chart that shows the distribution of players by cluster as displayed below.

Each team’s player composition by cluster (Source: Author)

As we can see from the figure above, each team used a different composition to win the season. Denver and LA was quite well distributed, Golden State had a strong supporting cast for their only all-star (for that season), Milwaukee had more veterans and Toronto seemed to make most use of their core players.

Nevertheless, there are some pattterns to be seen:

At least 3 play-off veterans (cluster 3),
At least 4 strong supporting players (cluster 1),
At least 2 off-the-bench defensive big men (cluster 4),
And obviously, at least 1 superstar (cluster 2)

The main reason this model does not show a real pattern is because all the players in cluster 2 has a unique playing style. Therefore, each team used a different composition based on the attributes of those players.

A better way of interpreting this model output is by using “domain expertise”. As a huge basketball fan, I use my knowledge of the superstars’ unique playing style and analyze how the cluster distribution helped him win a ring. Based on my “domain expertise”, I interpret the model output as follows:

Denver Nuggets: Nikola Jokic is an all-around centre with a pass-first mindset. He needs an all-star level guard like Jamal Murray who can score. Having 4 strong role players ad 2 efficient veterans complements the duo’s dynamic.
Golden State Warriors: Stephen Curry is probably the best shooting point guard of all time. He needs a strong supporting cast to give him space for high-percentage shots or complete his assists and take care of defense. (Note: Klay Thompson just recovered from injuries during that season so he ended up in cluster 2).
Milwaukee Bucks: Giannis Antetokounmpo’s strength is in his powerful drive hence needs Jrue Holiday and Khris Middleton as alternative options. The presence of many veterans also helped give guidance to the then-relatively young superstar.
Los Angeles Lakers: LeBron James is arguably the greatest all-around player of all time. Having Anthony Davis gave him an excellent alternative when he was tightly covered. The other components helped strengthen the defense and gave more scoring opportunities when they were both on the bench.
Toronto Raptors: Kawhi Leonard, a dominant two-way player, joined for a one-year contract so the team had to focus on their core players. This probably explains the high number of benchwarmers.

Conclusion

Building a machine learning model is easy, understanding it requires domain expertise. For the NBA, this comes from watching games and highlights. I would not call myself an NBA expert but the reason I managed to interpret the output (feel free to argue in the comments) was years of watching games.

The model may not have given a straightforward answer on the winning formula of a championship team, but it can help us give a base for further analysis. This project is far from ideal, as the data was limited to 5 teams and model did not pass the elbow test. Since NBA teams today have highly-skilled data analytics teams, there are indeed several interesting approaches to find the formula of a winning team. It is exciting to see how technology continues to develop and solve more challenges for our favorite NBA teams!