Clustering NBA Players using Python, Scikit-Learn, Pandas

Wilson Xie
Data Science Student Society @ UC San Diego
8 min readMay 26, 2020

People say the NBA has changed. You might have heard phrases like “the players are “soft” nowadays’’, “everybody just shoots 3s”, or “too many fouls”. But did the game really change? To answer this question, I developed a clustering method to divide NBA players into categories for two different decades: 2010, and 2020. Doing this allows us to compare types of players to see if there were changes between the years. To cluster NBA players, I am going to use unsupervised Machine Learning models, PCA and K-means clustering to re-categorize NBA players into six different roles. Why six? You’ll see.

This article assumes elementary knowledge about PCA and K-means clustering. If you are not familiar with PCA, take a look at this article for an overview of the algorithm. If you are not familiar with clustering, also check out this article for a walk-through of hierarchical clustering.

Table of Contents

Imports

Data Cleaning

Modeling

Analysis

Conclusion

Imports

All the imports

In this project, we are going to use Python through Jupyter Notebook. We will use Numpy and Pandas for data wrangling. We will also need Matplotlib to visualize our results. Lastly, we will use Sklearn when doing K-means clustering.

NBA Player Data from 2020, PerGame vs Advanced

In this tutorial, we are going to use the NBA player data from 2010 and 2020, taken from basketball-reference.com. The reason for doing so is that basktball-reference.com has already stored the data in a DataFrame format. Therefore, there is no need for web scraping. However, if you want to learn about web scraping, checkout this article on web scraping.

So, how do we import our data? It turns out that since the data is in table form already, we could just use pd.read_html. I use two datasets: one shows the players’ stats per game, the other shows the advanced stats per player. The advanced stats describes players with more complicated stats such as the True Shooting Percentage (TS%), Usage Rate (USG%), and etc.

Data Cleaning

Data Cleaning Process

The next step is to clean the data. Here is the summary of my data cleaning.

● Concatenating two datasets (players’ stats per game & advanced stats per player)

● Dropping meaningless columns (rank, age, and etc.)

● Filtering players (avg. 16 minutes or above)

● Normalizing the data

Cleaned and Normalized DataFrame

Modeling

To apply PCA on the data, I used every column as a component because each column represents an important measurement of a player. The reason for doing PCA now is to calculate the variance ratio. This number is the ratio between the variance of the components and the total variance. The higher the ratio, the more information will be included. We want to aim for a ratio close to 100%, but not exactly 100% because we want to avoid over-fitting. Based on the elbow graph below, choosing components between 10 to 15 seems reasonable. So, I will use 15 as the component number.

PCA result

After we finish our PCA algorithm, the next step is to cluster the data points. We are going to use Sklearn again for K-means clustering. I decided to use six clusters according to the silhouette scores. I use the algorithm for the silhouette score from THE VI5ION.

Silhouette Score for 2010 Data
Silhouette Score for 2020 Data

For the second graph, the highest y-value would be the cluster number. Although the highest numbers are two and three, it would not be realistic to separate 375 players into two or three clusters. Instead, I will use six clusters for both data since the scores are still high.

Here is what the graph looks like after the player datasets have been reduced to a lower-dimensional space and clustered.

Clusters for 2010 and 2020 Players

Analysis

Code Example

Now, there are six clusters of players for both 2010 and 2020, but the cluster types are not all the same. Let us break down the data based on the mean of each statistic for each cluster.

Proportion of NBA Players
Proportion of NBA Players in Pie Chart, 2010 vs 2020

Player Type 1: Center

2010
2020

I would categorize these players as centers because they protect the rim quite often. They have a high percentage of shots, but they don’t usually play a big role in the offense — their skills are shown in defense. Grabbing rebounds and blocking shots are things they’re good at. They’re the “board man” players. In 2010, the league consisted of about 12% of these rim protectors. This year, they make up 8% of the league. As you can see, there is a major decrease in traditional centers in the league today. This is because teams today rely on their offense on shooting 3-pointers and driving to the baskets. So, centers like these would take away space in the interior, leaving less space for rim slashers or pick & rolls for shooters.

Player Type 2: 3D/Role Player

2010
2020

These are players that do not score often. But they are excellent at catch & shoot, as well as stopping the opponents’ offense. Thus, they’re called 3D. The proportions of 3D players between 2010 and 2020 are about the same, meaning that these types of players are still essential in the league. These players are crucial for many reasons. They don’t have a high usage rate, meaning that they don’t often get to touch the ball on the offense. This would allow the star players in the team to score comfortably.

Player Type 3: 3-point Sharpshooters

2010
2020

This cluster contains the most players out of all clusters for both years. They are 3-point sharpshooters. Their job is to ensure that as many 3-pointers are made as possible in a game. What differs them from 3D players is their role on the defense. As their only skills seem to be shooting from downtown, they are not so good at defending. Thus, their defensive statistics are all below average. These 3-pointers could generally produce more 3s than 3D players, but they allow opponents to score more easily.

Player Type 4: All-stars

2010
2020

This is the elite group. They are the stars, the efficient scorers, and the leaders. They usually attack the rim or pull up in the mid-range. Comparing the stars between 2010 and 2020, the proportion is about the same. This cluster does not have any particular stats to talk about since everyone knows their games. However, I do want to point out that LeBron James is in both of the clusters.

Player Type 5: Shot Creators

2010
2020

These are the scoring players in a team. Some of them are star players of the team: they all are able to bring good offensive efforts. They can shoot 3s and 2s, and the percentages are quite good. They’re very well-rounded, and teams always need them right next to the star player. When a star player is struggling in a game, these players would step up and help the team.

Player Type 6: Power Forwards and 3-point Shooters

Here comes the tricky part. In the 2010 graph, there is a cluster called “Power Forward”. This is the group that contains players like Tim Duncan, Kevin Garnett, and Chris Bosh. They can play like centers, but they are also great on defense. They’re the traditional power-forward players, doing post-ups, dunks, and staying mostly near the paint. Looking at the 2020 graph, the power-forward group is missing: they’re replaced by a group of less efficient players. They have stats below average in the league, and their way of scoring only seems to be just shooting 3-pointers. However, their 3-point percentage are not as good as the 3-point sharpshooters.

In this player group, about 66% players are wing players for 2010 the data, and about 90% players are big men for the 2020 data. Based on our clustering result, the only change from year 2010 to 2020 is this group of players. We could say that today’s NBA teams favor an extra wing player on the lineup rather than a power-forward.

2010
2020

Conclusion

Using PCA and K-means clustering, I was able to distinguish players in the league and categorize them. I applied this algorithm to data from 2010 and 2020, which allowed me to compare players from two decades. In today’s NBA, players have mostly the same archetypes. In both decades, there are similar proportions of 3D players, 3-pt shooters, well-rounded scorers, and all-star players. The difference between the two decades, however, is the decrease of big men and power forwards. Instead, teams today do not need big men to protect the rim and instead rely on more effective shooters that can make 3-pointers and create spaces. Based on the findings, I would conclude that the game did change from 2010 to 2020. This clustering method could also be extended to use on any of the NBA years, allowing us to compare players from different years.

--

--