Segmenting NBA Statistics with K-Means (Who are the G.O.A.T.s?)

Published in

Madison John

10 min readApr 16, 2020

This article summarizes the process and findings of a study to use clustering algorithms to group NBA player performance statistics and analyzing the resulting clusters.

01 INTRODUCTION

Topics

NBA

The National Basketball Association is a professional men’s basketball league that has 30 member teams, 29 in the United States and 1 in Canada.
The NBA held its inaugural season in 1946, and was in the midst of its 74th season before being suspended due to the COVID-19 pandemic.

Clustering

Also known as cluster analysis, this is the process of dividing a set of observations into groups called clusters.
The goal of clustering is to sort observations such that the observations within a cluster are more similar to one another than observations outside the cluster.
Cluster analysis is one of the methods of unsupervised learning, a type of machine learning that searches for patterns in a dataset with no labels.

Project Goals

The goal of cluster analysis for the NBA dataset is to segment the player statistics such that we are able to identify tiers of performance in order to potentially answer such subjective questions as:

Which NBA seasons are the greatest of all time?
Which NBA players are the greatest of all time?

Dataset

Source

The dataset is a merger of two sets of player performance statistics from Basketball Reference.

Data from 1950–2017 was downloaded from Kaggle, uploaded by Omri Goldstein (username: drgilermo)
Data from 2018–2019 was sourced directly from Basketball Reference.

Features & Observations

The dataset contains 25,694 observations (after dropping blank rows)
The datatset contains 50 columns (after dropping blank columns)

02 Data Cleaning

The following were the steps taken to rid the dataset of inconsistencies and organize variables in a meaningful way.

3 blank columns were removed.
67 blank rows were removed.
Player names were cleaned of non-standard / accented characters.
The 50 variables were sorted into 46 continuous variables and 4 categorical variables.
Columns with missing values were not removed nor interpolated at this time. These variables coincided with statistics that were not collected or calculated in the earlier years of the NBA.

03 Data Exploration

Alright, let’s get to visualizing these statistics. Since there are so many, we’ll view them as clusters (pun totally intended) of similar items. For details on the metrics, you may refer to Basketball Reference’s glossary, which will be especially useful for the more advanced metrics.

Basic Metrics

First off, we have the simple count and record metrics. No advanced formulas required here, just raw numbers.

There is generally a positive correlation between the 5 basic metrics; however, BLK (blocks) appears to correlate the least with the others.
BackCourt players on average appear to record more AST (assists) and STL (steals) whereas FrontCourt players have an edge in TRB (rebounds) and BLK.

Shooting Percentages

Next we have various shooting percentages which is simply shots made divided by shots taken.

There is generally a positive correlation between the various shooting percentages except for FT% (free-throw percentage), which appears to have no correlation with the others.
BackCourt players on average appear to have higher FT% while FrontCourt players have an edge in FG% (overall field-goal percentage)

Win Shares

Here we have the various win shares, which is the estimated number of wins contributed by a player.

There is a positive correlation between WS and its offensive (OWS) and defensive (DWS) components; however, WS correlates better with OWS than DWS.
There is no indication that BackCourt and FrontCourt players perform differently on average with respect to WS.

Player Ratings

Finally we have here the advanced player ratings: Box Plus/Minus, Value Over Replacement Player, and Player Efficiency Rating.

BPM correlates well with OBPM (its offensive component), as well as VORP and PER.
DBPM does not correlate well with the other player ratings, though FrontCourt players appear to perform better with respect to DBPM.

04 Feature Engineering

In this section, we discuss any new variables we generated and which variables we selected as features for the model with which we will attempt to cluster.

Position Grouping

Number of players per position designation.

There are 5 standard positions in basketball, though the dataset includes several combinatory designations.

Additionally, there are two commonly accepted position groups, based on their similarity in physical measurements and roles on the team.

BackCourt (Guards)
FrontCourt (Forwards, Centers)

The Pos variable was grouped into these two position groups, generating the new variable PosGrp.

Normalization

Due to the 50 player performance metrics having a variety of ranges, each variable was normalized in two ways to compare them all on an even playing field.

Normalized against each season’s max (across all players)
Normalized against each player’s max (across all seasons)

The two sets of normalized variables were then used to generate new variables by taking their means.

SsnScr (‘season score’) — the mean of variables normalized against the season max
CareerScr (‘career score’)— the mean of variables normalized against the player max
These new variables were then normalized to produce SsnScr_norm and CareerScr_norm.

To avoid having to shift ranges due to negative values, the following variables were excluded from normalization.

['PER', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP']

Sample Visualizations with Normalized Variables

With normalization, we are able to generate the two heatmaps below.

The first shows the normalized season statistics for the top 10 players, as sorted by SsnScr_onrm, during the 2009 season. For each statistical category, we can compare how well each player performed relative to the season’s best.

Normalized season statistics for 2009, sorted by SsnScr_norm, top 10 players only.

The second heatmap shows the normalized career statistics for a single player, James Harden, spanning his entire career. For each statistical category, we can compare how well Harden performed relative to his career best.

Normalized career statistics for James, Harden, sorted by CareerScr_norm.

Model Feature Selection

We started off with 50 variables in total, 4 categorical and 46 continuous. After exploring the data and generating new variables we are ready to select the variables that will go into our model as its features.

First, we have our 5 derived features:

SsnScr, SsnScr_norm — a measure of a player’s performance for each season relative to his peers in the NBA.
CareerScr, CareerScr_norm — a measure of a player’s performance for each season relative to his own career best.
PosGrp — a grouping of the positions of NBA players contained in the Pos variable

Finally, we have our 9 variables that were excluded from normalization:

Player Efficiency Rating — PER
Win shares — WS, OWS, DWS, WS/48
Player ratings — BPM, OBPM, DBPM, VORP

05 Clustering Execution

Clusters generated with K-Means using random data.

K-Means Clustering

As mentioned in the title, we will be using the K-Means algorithm to group the observations into clusters. K-Means is an iterative clustering algorithm that follows the steps below to converge on a solution.

Choose k points at random. These will be used as the initial centroids or cluster centers.
Assign all other points to its nearest centroid, forming k clusters.
Calculate the means of all points assigned to each centroid.
Replace the old centroid values with these k means.
Repeat steps 2–4 until the difference between successive results of centroid calculations is lower than a given threshold.

Choosing k — Elbow Method

Great, so now we know how K-Means works from a high level, but how do we know what value of k to choose?

One popular method to choose an optimal k value is the Elbow Method. It is so named because the optimal k is at the bend in the plot, which looks like a bent arm as shown below.

In other words, the optimal value of k is the value at which the inertia begins to decrease in a more linear fashion.

Inertia is the sum of the squared distances between the samples and their centroids. For the sample plot below, we would choose 3 for the value of k.

Re-creating the above plot with the NBA dataset we get the plot below. It is not as obvious where the optimal value is, though we can narrow it down to 5, 6, or 7.

Choosing k — Silhouette Coefficient

With the Elbow Method, we were able to narrow down the possible optimal values for k to three values, but we need to finalize on a single value.

Another means of evaluating clusters is by calculating the silhouette coefficient, which is a measure of intra-cluster data point similarity.

Silhouette coefficient

The silhouette coefficient of a data point is the difference between the mean distance of a data point and all points in other clusters (bᵢ) and the mean distance of the same data point and all points in its own cluster (aᵢ) divided by whichever is larger.

The silhouette coefficient for a cluster is simply the average of all silhouette coefficients of its data points. Similarly the silhouette coefficient for a k-cluster solution is the average of all silhouette coefficients of all data points.

Silhouette plots for clustering candidates identified by Elbow Method

Thought the plots aren’t quite aligned, we can see that the silhouette coefficient average (the vertical red line) for the 5-cluster solution is the highest of the three. Therefore, we will use 5 clusters for the NBA dataset.

06 Clusters Evaluation

Now that we finally have our clusters defined, we can analyze them with respect to our model features. Below we have the season and career score features grouped by their cluster designation.

Wit the season scores, we see 5 distinct and well-separated IQRs (the colored boxes), each associated with a different cluster. This is the not the case for the career scores, as 3 of the clusters share similar range while the other 2 are relatively close as well.

Season and career scores grouped by clusters

Using the pair plots below as a visual aid we can define the clusters as follows:

Cluster 0: The best of the best. G.O.A.T. candidates. The cream of the crop. Not only are these performances among the best of their respective seasons, the players in this cluster are performing at or near their career peaks.
Cluster 1: The players in this cluster are also performing near their career-best and above-average relative to their peers, though not as well as the elites in cluster 0 during their respective seasons.
Cluster 2: Though these players are performing similarly to their career-best, their best is only average relative to the rest of the league.
Cluster 3|4: The players in these clusters are performing below-average during their respective seasons and are also performing much worse than their career peaks, whether that be due to injuries, rookie years, or general decline.

So, circling back to our project goals, can we answer which season and player were the greatest? Let’s find out!

Greatest Seasons

The table below shows the number of players in each cluster for a given year, sorted by cluster 0. According to this ranking, the top 3 seasons are 2008, 1997, and 2001.

Per-season cluster distribution of players

Another way to rank seasons is to take the average of the normalized season scores (SsnScr_norm) for each cluster and again sorting by cluster 0. This yields a different set of seasons: 1999, 1984, 1992.

Per-season average of SsnScr_norm, by cluster

Greatest Players

The table below has been sorted by the cluster 0 column to show which players have spent the most seasons in the elite cluster.

Per-player cluster distribution of seasons

Ignoring clusters altogether, the table below is sorted by the average of the season and career scores, indicating players that are performing consistently excellent with regards to their peers and their own career best.

Per-player average of normalized season and career scores

07 FINAL THOUGHTS

So, what have we learned?

First, I fully acknowledge that the results of this project will not settle any debates regarding who is the greatest player or which was the greatest NBA season. If anything, the rankings above may ignite fiery new discussions, which, as a fan of the game, I most definitely welcome.

Rather than calling the above tables as greatest lists, we can instead be more specific in defining the rankings. Both sets of tables are a debate of quantity vs. quality:

Seasons: Which seasons had the most players performing at the elite level vs. which seasons had the better-performing players performing at the elite level.
Players: Which players spent the most seasons performing at the elite level vs. which players were there while playing their career-best.

No matter if you agree with the rankings generated by this analysis or preparing to report this article for inciting a riot or are somewhere in the middle, I hope you either learned something, had a laugh, or both!

You can find the Jupyter notebook with the code and additional plots and analysis on GitHub.