Clustering NBA Player using K-Means

Published in

Nerd For Tech

8 min readMar 26, 2021

Today basketball-game has changed completely from a decade ago. It has more versatile-game play today with the more versatile players and versatile positions. The old game-play is focused on the rotation and movement of the player synchronously while in this today game-play, the player-run faster and agile. There is still also rotation there but in the different levels of movement speed. There is also a lot of 3 point shoot in today's game.

The development of the basketball game also has an impact on the other section as well, especially players’ positions. In basketball there are conventional 5 positions i.e point guard (PG), shooting guard (SG), small forward (SF), power forward (PF), and Center. Center and power forward usually filled by big men players who have high skill in rebounding. In today's game, the big men also capable also in shooting, even 3 points shoot. They do not need to always move in the paint area, they could easily move out the paint, ask the ball, the shoot 3 points. So, the restriction of each position has completely changed.

The interesting, Brad Stevens, Celtics head coach also stated the similarities. According to Bleacher Report, Brad Stevens said that the Celtics do not play with 5 positions anymore. Instead of that, the team plays with 3 positions now. According to his statement, the positions are a ball-handler, a wing, and a big.

I don’t have the five positions anymore. It may be as simple as three positions now, where you’re either a ball-handler, a wing or a big. It’s really important. We’ve become more versatile as the years have gone on. — Brad Stevens, per Kareem Copeland of the Associated Press

Golden State Warrior also invented a new game basketball play with their ability to play small ball focused on Steph Curry and Klay Thompson. Their superstar Steph Curry is considered as a point guard, but he also has high skill in scoring points like Klay Thompson and Kevin Durant.

This phenomenon had my interest. Does Brad Steven’s make any sense or just bluffing? As a data scientist, I want to analyze from a data perspective. How Brad’s new basketball positions if interpreted by statistics.

On the other perspective, this article also shows a simple example of how to implement data science in the field of sport especially basketball.

To fulfill the objective, I used an unsupervised machine learning approach for clustering. Since the dataset does not contain any label, so clustering is the fittest method. You may have heard about some clustering methods such as KMeans, Hierarchical, or Density-Based Spatial Clustering of Application with Noise (DBSCAN). However, I prefer using KMeans as the algorithm is quite simple and faster.

Brad’s statement happened in July 2017 so the best statistic to explain the objective is the data around that time. Luckily, there is a dataset of NBA statistics for seasons 2017–2018 that have been published in Kagle. You could also access the dataset here.

The dataset contains a total of 59 columns or attributes. However, not all of it would be trained in the KMeans algorithm. Manually, I chose some attributes that I think are the most representative. These attributes in general explain either the offensive or defensive skill of the basketball players.

The offensive attributes are Field Goal Made(FG), Field Goal Attempt (FGA), Field Goal Percentage (FG%), 3 Point Shoot Made (3P), 3 Points Shoot Attempt (3PA), 3 Points Percentage (3P%), 2 Points Shoot Made (2P), 2 Points Shoot Attempt (2PA), 2 Points Percentage (2P%), Free Throw Made (FT), Free Throw Attempt (FTA), Free Throw Percentage (FT%), Point Made (PTS), Assist Made (AST), and Offensive Rebound (ORB).

While the defensive attributes are Defensive Rebound (DRB), Total Block Made (BLK), Total Steal (STL), and Turnover Made (TOV). Depend on both offensive and defensive attributes, the KMeans cluster algorithm would try to differentiate the NBA players into 3 groups.

Before implementing KMeans clustering, first I pre-processing the dataset to remove unused data. The pre-processing steps briefly look like the below picture.

The first step of pre-processing steps has been explained before. After that, I aggregating the dataset based on the player. This was done because, in a single season, there are players that played for more than 1 NBA club. As a result, this group of players has multiple performance statistics.

Lastly, I remove players that in a single season do not play many minutes. I want to focus on analyzing NBA players that play regularly. So, players that do not play more than 25% of total games in a season would be removed from the dataset. Roughly approach, I only used players that play more than 1,000 minutes in season 2017–2018.

Below are the comparison of histograms from the whole dataset versus datasets after I remove some players. The histogram shows the total minutes. In the left chart, seen a significant spike in a total of minutes play less than 1,000. After removing those data, I get the right graph which seen more proper with Gaussian distribution of data, even slightly tilted to the left. For the KMeans algorithm, data with Gaussian distribution are more suitable to analyze.

As an addition, here are the scatter and histogram plots of each offensive attribute in the dataset. Seen that most of the attributes also have a Gaussian distribution of data, except for 3P%, FT%, and AST.

The Distribution of Offensive Attributes

While below are the scatter and histogram plots of defensive attributes in the datasets. Similar to the offensive attributes, most of the attributes here also have a Gaussian distribution, except for DRB and BLK.

The Distribution of Defensive Attributes

You may check the correlation value of each attributes below

The next step is implementing KMeans clustering algorithm. This algorithm require us to define the number of cluster. To do that, I used the Elbow Method. Elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use. Below are the graph of the Elbow’s result.

Elbow Method Graph to define the Number of Clusters

The number of proper cluster is when the line in the graph started to sloped. Using above graph, we know that the line started to sloped either in number of cluster 2 or 3. Here, I chose 3 as the number of cluster for KMeans algorithm, since this was in line with Brad’s statement.

Here are the scatter plot of the attributes group by the KMeans cluster’s result.

KMeans Cluster Result seen from the Offensive Attributes

KMeans Cluster Result seen from the Defensive Attributes

The result are quite similar for each attributes either for offensive or defensive. The final step are interpretating each cluster of KMeans’ result. You may want to check this bar graph first.

The Comparison of Each Cluster by Attributes

Roughly, seen that player in cluster-1 overpowered the other clusters. Lets take a look more detail.

Interpretation Cluster-0

From the above graph and focus to those 3 attributes, seen that the players in cluster-0 had quite good number in ORB, DRB, BLK, and PTS also. Although still below the players in cluster-1, however, the difference is not that significant. I assume that the players in this cluster had quite good skill in scoring a basket, but also had quite good sense in the defense. Since their number of ORB and DRB are not that bad. Overall, this kind of players are the players that good both in offense and defense. Let’s see some players in this cluster with their statistics.

Interpretation Cluster-1

You may also notice from above graph that the players in this cluster have exceptional number if some of offensive attributes. They are the best scorer in the league, create most shot attempt both for 2PA and 3PA. I assume that this players are some of floor general in the team. They create changes but also execute it sometimes. Not typicaly big men since their number of ORB dan DRB are below the cluster-0’s players. Some players in this cluster are:

It does make sense since there are some famous name like LeBron James and Russell Westbrook whose skill are perfectly matched with the characteristic in cluster-1.

Interpretation Cluster-2

Lastly, there are players from cluster-2. This players have the highest number in ORB, DRB, and BLK. Easily, this group are mostly filled with the big men players. I assume they are the best defender in the league who protect the rim. Let’s take a look some players from this group.

The result from KMeans clustering and its interpretations are quite interesting. Somehow, the result represents Brad’s statement. Based on Brad Steven’s statement and our KMeans cluster result, it seems that ball-handler players are similar to our player in Cluster-1. This kind of player is the floor general type, has the ability to distribute the ball, and has the scoring ability as well. Wing players are similar to Cluster-0’s players since the player is more versatile. They can play both on offense and defense. Cluster-3 represents a big men's player in Brad Steven’s opinion. This player has the highest defense awareness with an exceptional number of blocks.

I am personally satisfied with the result. Even though using the simplest algorithm but the result gave us quite an understanding of the objective. Hope this article of mine could entertain you but also gave a brief example of how to implement data science in the field of basketball.

Thanks!

Clustering NBA Player using K-Means

Interpretation Cluster-0

Interpretation Cluster-1

Interpretation Cluster-2

Written by Rio Rizki Aryanto