Clustering NBA Players into Offensive Roles Based on What They Did, Rather Than How They Did

An Unsupervised Machine Learning Model for Clustering and a Method to Evaluate the Quality of Clustering Heuristically

Xu Lian

19 min readJul 26, 2022

About the Project

Clustering: This project categorizes NBA players into various offensive roles with clustering algorithms, based on what they did on offense rather than how they performed.
Evaluation: This project also relies on a non-linear dimensionality reduction technique to evaluate the quality of clustering.
Comparison: Combined with performance-related data, this project discusses which groups of players are more valuable than their peers.
Familiarity: The features used for clustering are play-type data from Synergy. Hence, such a model could be conveyed to coaches, scouts, and executives with little difficulty thanks to the familiarity they have with the terms.
Applicability: Since Synergy labels play-type data under the same standard globally, the techniques used in this project could be applied to other leagues in the world.

Motivation

Below are some of the motivations that prompted me to start this project.

I want to build something that can classify players’ offensive roles in a non-arbitrary way. It all starts from a notion that conventional position terms (point guard/shooting guard/small forward/power forward/center) provide less information about a player than ever. Over the years, the game of (NBA) basketball has been evolving drastically. The increasing popularity of pick-and-roll (PnR) plays makes it easy for us to classify players as ballhandlers, bigs, and wings (or “guys standing in the corners”) on offense. For example, it is easier for us to see Jayson Tatum as a car-key holder of the Celtics’ offense instead of trying to assign the right number (1–5) on him because the guy just had so many reps as a pick-and-roll handler. Therefore, I think it would be cool to have an unsupervised machine learning model that could see that as well.
I intend to build on top of the literature I have read. In the recent past, many have attempted to group players into various roles based on stats of their choice with clustering algorithms. There is one great research paper by Samuel Kalman and Jonathan Bosch, which is featured at the 2020 MIT Sloan Sports Analytics Conference [1]. The methodology was nice, as they used soft labeling techniques like Gaussian Mixture Model to classify players. However, I have different opinions regarding their feature selection, which I would discuss more in the “Data” section. In terms of approach, I am a fan of Todd Whitehead’s work [2]. I think, his idea of classifying players’ offensive roles based on the types of shots they took rather than the efficiency they achieved, makes tons of sense. In his study, Whitehead used hierarchical clustering for classification, giving out descriptive insights regarding players’ offensive roles. But information regarding the quality of the procedure was not disclosed. With this project, I attempt to use an alternative algorithm for clustering, with visualization that helps to evaluate the quality of clustering heuristically.
I like the idea of having something that can be applied to not only the NBA but other leagues in the world. For A good number of the basketball leagues outside of the US, keeping quality data is indeed a luxury: Boxscores could be recorded with inaccuracies; Game logs (key ingredients to determine the actual numbers of possessions) could be not available in the public domain. Therefore, data provided by third-party organizations, such as Synergy and InStat, could be the only resources that teams can rely on (if they are willing to pay the subscription fees). Therefore, it would be great if there is a way to extract more information/insights out of the data they provide.

Data

What data were used for clustering?

For clustering, I used the individual play-type data provided by Synergy.

Synergy Sports, now a division of Sportradar, provides coaches, scouts, and executives around the world with a large-scale library of basketball data that links video clips to play-type labels. Though not everyone is able to pay the subscription fees, luckily for NBA fans, the league’s play-type data are publicly available [3].

Synergy categorizes possessions into the following 11 play types:

Pick-and-roll (PnR) Ball Handler
Pick-and-roll (PnR) Roll Man
Isolation
Post-Up
Spot Up
Transition
Off Screen
Hand Off
Cut
Offensive Rebounds (Putbacks)
Miscellaneous

For this study, I took out the “Miscellaneous” category. Then, I re-calculated the weights of each player’s individual possessions on each play type.

Since the game of basketball is distinctly different from what it was a decade ago, it is probably not a great idea to include players from different “eras” into the same pool. Therefore, I determined a 5-year window is probably the right time frame for balancing between consistency of the style of the game and the sample size.

I included players who had a minimum of 200 possessions with a team, starting from the 2017 regular season. After filtering, there are 1774 qualified samples left for me to work with.

What are the benefits of using Synergy’s data?

Familiarity

Every coach is familiar with the play-type terms. They would let players with the shot-making ability get more iso reps, and ask players to only shoot spot up and look for opportunities in transition if the latter’s skill sets are limited. Therefore, a clustering study based on play-type data only could lead to easier conversations with coaches, who already possess strong knowledge of the X’s and O’s.

Below is an example of shooting distributions of four players on the same team, with star players on the left (Joel Embiid and James Harden) and role players on the right (Georges Niang and Matisse Thybulle).

Shooting Distributions of Four Players | Philadelphia 76ers | 2021–22

The difference between left and right should be fairly obvious. Star players are given more opportunities to attempt high-difficulty shots, such as “Isolation”. The role players found their shots in “Spot Up” and “Transition” more often.

Below are the distributions of individual weights and points per possession for each play type. Other than “Spot Up” and “Transition”, the distributions of the other play types are right-skewed, indicating that privileges do exist for those who are deemed capable.

Applicability

Over the past years, tracking data, such as those generated by Second Spectrum, provides much richer insights into the game than ever, such as how a team fares when they choose to switch on pick-and-roll plays. However, NBA, which is a revenue-generating monster in its own right, is probably the only basketball league in the world that enjoys the luxury. Most leagues outside the US just do not have the resources to enter the era of tracking data.

The good thing about Synergy is, that they tag play-type data globally. There is a uniform standard applying to hundreds of basketball leagues. Therefore, teams with access to Synergy data can apply the same methods to their own league as well.

Why not use boxscore stats and other advanced metrics for clustering?

As I stated in the previous section, I had different opinions regarding feature selections in the paper by Kalman and Bosch, which is “a combination of advanced statistics, per 100 possession statistics, and shot distribution statistics” [1]. In the end, they chose 23 variables, including box-score stats, such as offensive/defensive rebound rate, and advanced metrics like player efficiency rating (PER), which is actually a per-minute rating.

I understand the authors intended to analyze lineup efficiencies on both ends of the floor. Therefore, both offensive and defensive elements are needed at the stage of clustering. However, I prefer not throwing offensive and defensive stats into one pot. I think it would be better if the offensive and defensive roles of a player are classified separately. Just like how we rate offense and defense units in football differently (The only difference is that basketball players have to play on both sides). For example, both Robert Williams III and Chet Holmgren are bigs that can drop and switch. However, Williams is more of a roll-and-cut type while Holmgren can really stretch the floor with his shooting ability. Hence, for this project, I would rather address the classification of offensive roles only. (Side talk: In a perfect world, I would like to have two metrics that summarize a player’s offense and defense separately and add them up, instead of something like DBPM that equals BPM minus OBPM.)

In addition, many of the features used for clustering in the Kalman and Bosch paper, are performance-related [1]. Therefore, the labels could be performance-dependent. For example, a player’s label could be altered if he experiences a decline in 3-point field goal percentage over the course of a season. That could potentially lead to confusion during the meetings with coaches in-season. If we use play-type data only, then a player’s role (label) could be settled in a relatively quick fashion as the number of his individual possessions increases, since the skillset of a player just does not suddenly expand during the season.

Moreover, I would stay away from PER or other per-possession metrics (e.g. Box Plus-Minus). Other than the fact they are performance-related, it is just easier to tell a coach “We put him in this bucket based on his play-type distribution” instead of “We put him in this bucket because his PER falls into a certain range”. Moreover, I think, introducing advanced metrics after clustering could provide quality insights. Say, players in cluster A have a certain play-type distribution, then we can get an OBPM distribution (value range) for this group of players and use that to compare against other groups.

What are the other data sources I used?

For evaluating the quality of clustering and analyzing the differences among clusters, I scraped basic data (e.g. height and weight) and advanced metrics (e.g. usage rate and true shooting percentage) from Basketball-Reference.com.

As for the later stage of analysis regarding lineup configurations, I scraped lineup efficiency stats from CleaningTheGlass.com.

Clustering Methods

Step 1: Principal Factor Analysis (PCA)

Principal factor analysis was applied before clustering for the following two reasons:

PCA is a linear dimension reduction method, and dimension reduction is used to counter the curse of dimensionality [4]. In this project, I was able to reduce the number of features to seven while retaining 95% of the variance with PCA.
There are high correlations existing among some play types. That is understandable. Bigs tend to roll, cut, and make putbacks at a higher rate than smaller players. Therefore, it is better not to treat all play-types equally, especially when you run a Euclidean-distance-based clustering algorithm like k-Means.

Step 2: Spectral Clustering

I used spectral clustering to classify players into groups (roles), which is a connectivity-based clustering algorithm [5]. Such an algorithm is implemented under the notion that “two houses can be called neighbors if they live close”. Compactness-based clustering algorithms, such as k-Means clustering, would make a stronger assumption — “on top of the proximity to each other, two houses can be called neighbors if they share the same town center (centroid)”.

To determine the ideal number of clusters, I used the silhouette score method. For this project, I ended up with seven clusters.

Meanwhile, since there are players (ex. Andre Roberson, 2017–18, Oklahoma City Thunder) with less than 200 individual scoring possessions that logged heavy minutes, a model that does out-of-sample predictions is needed. A k-Means clustering model was built for the later stage of analysis regarding lineup configurations.

Quality Evaluation Methods

Step 1: 2-D Plots for Certain Play-Types

One of the criteria I used for quality evaluation purposes, is to have eye-ball tests on whether the algorithm was able to split points into groups nicely. To be more specific, I wanted the algorithm to do a solid job at certain play types.

For example, for perimeter players, I would like to see clear lines “drawn” in skilled/high-difficulty play types such as “Isolation” and “PnR Ball Handler”. It is acceptable to see a less clear pattern in terms of “Spot Up” weights.

Step 2: t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-Distributed Stochastic Neighbor Embedding is a popular non-linear dimension-reduction technique for visualizing high-dimensional data. With t-SNE, I reduced the dimensionality of the post-PCA data to two.

Perplexity is a hyper-parameter of t-SNE that needs to be tuned at this step. It is “in a sense, a guess about the number of close neighbors each point has” [6]. I plotted ten figures with perplexity from 5 to 50, then chose the one that is able to yield a similar number of clusters with the silhouette score method.

t-SNE Plots with Perplexity from 5 to 50

To evaluate the quality of clustering heuristically, I finally plotted the post-t-SNE data in 2-dimensions with the tuned perplexity number, then colored the data points based on the cluster labels generated by spectral clustering. It seems most of the data points live closely with those sharing the same cluster label.

With the visualization generated by t-SNE, I was also able to compare traditional position labels and clustered labels side-by-side.

In addition, I can compare the results between spectral clustering and k-Means clustering, to see whether the latter produces results that are close in terms of similarity. If they are sufficiently close, then I could use the k-Means model for out-of-sample predictions in the later stage of lineup analysis.

Results

Cluster Overview

For NBA players, I obtained seven clusters after clustering. Below is an overview of the clusters, ordered by the average single-season OBPM within each cluster.

According to the results above, each cluster could be described as the following:

Cluster 2 (+1.91 OBPM; 197 single-season samples): Ballhandlers; Ball-dominant (high mean usage) attackers with the most reps in “Isolation” and the second-most reps in “PnR Ball Handler“.

Cluster 6 (+1.18 OBPM; 180 single-season samples): Bigs as the high proportions in “PnR Roll Man” suggests; Skilled with the most “Post-Up” opportunities; Could shoot a bit in “Spot Up” plays.

Cluster 0 (+0.27 OBPM; 161 single-season samples): Wings; “Off Screen” specialists; Could shoot out of PnR sets as secondary ballhandlers.

Cluster 5 (-0.20 OBPM; 165 single-season samples): Bigs; Roles are limited to roll and cuts; A major portion of shots comes from putbacks; Rarely shoot.

Cluster 1 (-0.45 OBPM; 448 single-season samples): Ballhandlers; Primary PnR handlers on the court; The second-most populous cluster.

Cluster 4 (-0.64 OBPM; 43 single-season samples): Wings; “Hand Off” specialists; The least populous cluster.

Cluster 3 (-1.19 OBPM; 580 single-season samples): Wings; Roles are limited to “Spot Up” and “Transition” only; Bigs who have limited reps in shooting and PnR sets can also fall into this category; The most populous cluster.

Below are some of the insights summarizing the results above:

Ball-dominant attackers (Cluster 2) and skilled bigs (Cluster 6) are the two offensive roles that are the most coveted. Teams are willing to feed them reps in play types that are more difficult to score, such as “Isolation”, PnR Ball Handler”, and “Post-Up”.
Primary ballhandler (Cluster 1), or point guard in common literature, is a role with fierce competition. In other words, players in this cluster could be vulnerable to being replaced if they see a decline in performance.
Despite having a limited role, roll and cut bigs (Cluster 5), on average, fared just fine on the offense in terms of OBPM, compared to other action-heavy perimeter players, such as primary ballhandlers (Cluster 1).
Players who showed limited capabilities of knocking down shots in high-difficulty play types (Cluster 3) would be given limited roles on offense. Obviously, they would face heavy competition since the supply is huge with the role being basically “Spot Up” and “Transition”.

Distributions

Other than the average numbers, it is also important to check the distributions of metrics for each cluster. Such a practice could lead to more insights and serve as a way to evaluate the quality of clustering procedures.

Height & Weight

It is noteworthy to point out that only play-type features are used in the clustering procedures. Physical attributes such as heights and weights are not included as inputs.

However, the model is able to assign players with similar heights and weights into the same buckets. Other than ball-dominant attackers (Cluster 2), it is easy to see the distribution shift from left to right as we go from the ballhandlers to the wings, then to the bigs.

Height & Weight Distribution by Cluster | NBA | 2017–2022

Points Scored Per Possession & True Shooting Percentage

To have an idea of how efficiently a player is producing points per shot attempt, true shooting percentage is one of the go-to advanced metrics to look at. Also, Synergy provides PPP (points scored per possession) that attempts to provide similar information regarding shot-making efficiency. Therefore, it would be interesting to plot distributions by cluster on these two metrics.

Based on the plots, roll and cut bigs (Cluster 5) stand out by quite a margin in these two categories. It makes sense since the majority of their shots are close to the rim.

Under the “Ballhandler” category, ball-dominant attackers (Cluster 2) fared better than the primary ballhandlers (Cluster 1). It is also understandable since Cluster 2 players 1) have the physical attributes to get shots close to the rim, and 2) have to perform to maintain their roles on offense in the long run.

PPP & TS% Distribution by Cluster | NBA | 2017–2022

Usage Percentage & Offensive Box Plus-Minus (OBPM)

Finally, let’s take a look at the distributions of usage and OBPM (one of the most popular plus-minus metrics that are publicly available) for each cluster.

Obviously, the two “ballhandler” clusters (Clusters 2 & 1) are on the higher end in terms of average usage. Players within these two groups could also reach a higher ceiling. Given the secondary ballhandling role, a small number of “Off Screen” specialists (Cluster 0) could also achieve a high number in usage. Conversely, it seems almost impossible for players with limited offensive roles (Clusters 3 & 5) to get past 30% in usage.

In terms of offensive output, ball-dominant attackers (Cluster 2) and skilled bigs (Cluster 6) are the two groups of players that could produce bonafide single-season performance (+7.0 OBPM). It is pretty challenging for players from other groups to achieve the same feat. For example, if we discard all of the single-season samples from Steph Curry, only one player out of Cluster 0, managed to achieve a +5.0 OBPM season. A limited role on offense also means a limited ceiling. For example, the highest single-season OBPM for a roll and cut big is +3.3 (Clint Capela, Houston Rockets, 2018–19).

USG% & OBPM Distribution by Cluster | NBA | 2017–2022

t-SNE Visualization

Team Makeup

With the help of t-SNE visualization, it is convenient to plot a team’s roster and check the makeup of a roster at one glance. Below are the examples of the two Finals teams this year (regular season).

t-SNE Plot | Golden State Warriors | 2021–22

Individual Players

t-SNE visualizations also allow us to follow players’ career paths year-to-year. Below are three examples.

Ex. 1: Harrison Barnes

Harrison Barnes signed with the Dallas Mavericks to get an expanded role on offense, after the 2015–16 championship season with the Golden State Warriors. He got his wish and experienced career years as he played like a Cluster 2 player.

However, once Dallas found the magic in Luka Doncic, the team acted quickly and shipped Barnes to the Sacramento Kings in the middle of the 2018–19 season. With the new team, Barnes experienced a role change, seeing his “Isolation” weights being cut by half and playing more like a skilled big (Cluster 6). Starting 2020–21, Barnes saw fewer looks in “Post-Up”, settling to a limited role (Cluster 3).

Ex. 2: Shai Gilgeous-Alexander

As one of the young risers in the league, Shai Gilgeous-Alexander saw his offensive role only expand over the past four years. His weight of shot attempts coming from “Isolation” rose from 5.2% in Year 1 with the Los Angeles Clippers to 29.0% in Year 4 with the Oklahoma City Thunder.

It is easy to identify that ascension from the t-SNE plot, too. After being traded as a part of the package for Paul George, SGA switched from Cluster 1 to Cluster 2 in his very first season with the Thunder. And he made another jump after the departure of his single-season mentor Chris Paul.

t-SNE Plot | Shai Gilgeous-Alexander | 2018–22

Ex. 3: Bruce Brown

Over the past five years, Bruce Brown probably had one of the most interesting careers in the league, and his changes in offensive roles were well recorded by the clustering model.

Brown was drafted 42nd overall by the Detroit Pistons in 2018 and played under a limited role (Cluster 3) in his first season in the NBA. He was given a larger role to run the team’s offense (Cluster 1) in Year 2, with 30+% of his shots coming out of pick-and-roll as a ball handler.

Then, the wild began.

Brown was traded to the Brooklyn Nets in the three-team trade on the 2020 NBA Draft night. He still participated in the PnR sets a lot with his new team, but as a roller. At 6-foot-4, Brown was inserted into the Nets squad as a short roll weapon by Steve Nash during the 2020–21 season. He rolled more often than the league’s bigs (12.6% of his shot possessions came from rolling), and he went on a roll. That switch did not go unnoticed by the model, which did not include height as a feature for clustering, as Brown’s 2020–21 season campaign was assigned to the roll and cut “island” (Cluster 5) of the t-SNE plot.

Limitations

Of course, there is room to improve.

First, despite t-SNE agreeing on the majority of the labeling work done by my clustering model, there are a few cases where they had different opinions.

For example, Zion Williamson’s lone All-Star season (2020–21) was classified as a ball-dominant attacker (Cluster 2) by the model, but the t-SNE plot suggested he still belongs in the group of skilled bigs (Cluster 6).

From a macro perspective, there is a part of the “common wing (Cluster 3)” group that seems to be closer to the two “big” clusters (5 & 6).

Use Wendell Carter Jr. as an example. His 2021–22 campaign was assigned to Cluster 3 by the clustering model but seems to still live in the “big” neighborhood (circled in red) according to the t-SNE plot.

t-SNE Plot | Wendell Carter Jr. | 2019–22

It is definitely something that is worth exploring in the future since I also found similar patterns in other leagues. Below is an example of the Euroleague and NBL during the same period.

t-SNE Plots | Left: Euroleague | Right: NBL (Australia) | 2017–22

Second, models could be improved with more features added.

In this project, all the features used are shot-attempt-related possessions. Some players, who actually are heavily involved in the pick-and-roll offense as ball handlers, would not be classified as ballhandlers (Cluster 1 & 2) since they either chose or were forced to pass out the ball after the picks more often than their peers.

For example, Steph Curry was the №1 pick-and-roll initiator for the 2017–18 Warriors with 818 possessions, according to Synergy. However, he either attempted a shot or committed a turnover in 414 of those possessions. That phenom likely came from the other teams’ commitment to either trap or switch on him in hopes of taking away his lethal pull-ups.

The ideal solution would be to have Synergy add a stat that shows how many pick-and-rolls a player starts over all the PnR actions when he is on the floor. However, that would probably require heavy hours of work from Synergy since they run operations globally (all leagues are treated in the same way). A more straightforward solution could be, to add a stat that shows how many times a player either shoots or commits a turnover out of all the pick-and-roll actions he starts, given that Synergy already has the numerators and denominators.

Conclusions

This project demonstrated a method to classify and analyze the offensive roles of the NBA players over the past five seasons.

With PCA and Spectral Clustering, I was able to categorize NBA players from 2017–22 into seven clusters/roles based on play-type features only. To evaluate the quality of clustering, I used t-SNE to visualize the results in 2-dimensions and found that most of the data points live close to those with the same cluster assigned.

After clustering, I combined the results with performance-related metrics, which allowed me to make comparisons among the clusters. Ball-dominant attackers (Cluster 2) and skilled bigs (Cluster 6) brought more value to NBA teams, on average. With opportunities limited to “Spot Up” and “Transition”, the common wing (Cluster 3) is a role with the most fierce competition due to a surplus in supply, and the ceiling of this role is capped as well.

Since only play-type data are used for clustering, the methods introduced in this project could be applied to other leagues in the world since Synergy labeled its data under the same guideline globally. Thanks to their familiarity with the terms, coaches, scouts, and executives should have little trouble in gaining information/insights from this project.

References

[1]: Samuel Kalman and Jonathan Bosch. NBA Lineup Analysis on Clustered Player Tendencies: A new approach to the positions of basketball & modeling lineup efficiency
https://www.sloansportsconference.com/research-papers/nba-lineup-analysis-on-clustered-player-tendencies-a-new-approach-to-the-positions-of-basketball-modeling-lineup-efficiency

[2]: Todd Whitehead. Nylon Calculus: Ranking the best and worst scorers in every offensive role
https://fansided.com/2017/08/09/nylon-calculus-ranking-best-worst-scorers-every-offensive-role/

[3]: NBA Advanced Stats
https://www.nba.com/stats/players/transition/?SeasonType=Regular%20Season

[4]: Tony Yiu. The Curse of Dimensionality
https://towardsdatascience.com/the-curse-of-dimensionality-50dc6e49aa1e

[5]: Neerja Doshi. Spectral Clustering
https://towardsdatascience.com/spectral-clustering-82d3cff3d3b7

[6]: Tyler Folkman. Why You Are Using t-SNE Wrong
https://towardsdatascience.com/why-you-are-using-t-sne-wrong-502412aab0c0

Thank you for reading! I hope you enjoyed the article.

Here is a link to the GitHub repo of this project.

You can also find me on LinkedIn if you would like to discuss this project or just want to have a chat about basketball.