Using Machine Learning to Find the 8 Types of Players in the NBA

Published in

Fastbreak Data

10 min readMar 2, 2017

***What do the axes mean?*** Using Linear Discriminant Analysis, I reduced over 50 dimensions (i.e. USG%, TS%, 3P%, AST%, BLK%, etc.) into 2 dimensions (i.e. principal components) which make up the (x,y) coordinates. These 2 principal components are essentially linear combinations of the original dimensions, and can be plotted on an XY plane for clustering and analysis.

Introduction

Inspired by the topological data analysis of Muthu Alagappan in “From 5 to 13: Redefining the Positions in Basketball” and the “Periodic Table of NBA Elements” by Stephen Shea in Basketball Analytics: Spatial Tracking, this study proposes an alternative method in which to classify players in today’s NBA.

Traditional 5 positions in basketball (https://www.myactivesg.com/sports/basketball/how-to-play/basketball-rules/basketball-positions-and-roles)

Both Alagappan and Shea presented a problem statement that still holds true today. The traditional five player positions incorrectly oversimplify the skill sets of NBA players. Simply pigeon-holing players into one of five positions does not accurately define a player’s specific skill set. Moreover, the misclassification of a player’s position may lead teams to waste resources on developing draft picks that do not fit their systems.

Coaches and scouts in the NBA already recognize players as having skills that exceed their predefined positions and have even come up with alternative position names to describe players such as the combo guard (i.e. a player that combines the attributes of a point guard and shooting guard) or the swingman (i.e. a player that can play both the shooting guard and small forward positions). Using machine learning, my goal is to uncover the positions that are intrinsic to today’s NBA players and classify players with a position that best encapsulates their skill sets.

Methods

The methods described below attempt to summarize the variety of applications and machine learning algorithms in layman’s terms. All programming was done in Python 2.7.

Step 1: Data Acquisition with Selenium

Selenium is a popular web automation application that can be used for web scraping purposes. For this study, I used Selenium to scrape Basketball-Reference.com for my data.

Can you guess who this player is? (http://www.basketball-reference.com/players/w/westbru01.html)

To best define a player, I identified a player’s career statistics from Per-100 Possessions, Advanced Metrics and Shooting Metrics. Using rate statistics (i.e. points per game) or cumulative statistics (i.e. total points) can be misleading when it comes to analysis because these statistics tend to inflate players with lengthier careers. To deal with outliers, I instituted a minimum threshold of 40 games played.

Prior to analysis, my data consisted of 547 players and 56 features (or dimensions) from 2014 to 2017. While this was definitely a small sample size, my goal was to uncover the various positions in today’s NBA rather than comparing today’s NBA players with those from stylistically different generations.

Note: The 2016–2017 data includes everything up until the NBA All-Star Break.

Step 2: Dimensionality Reduction with Linear Discriminant Analysis

As dimensions increase, the available data becomes more and more sparse.(http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/)

In high-dimensional data, the volume of space increases leading to the available data to become sparse. This is known as the “Curse of Dimensionality” and is problematic for any method that requires statistical significance. In this study, each dimension is represented by a player’s feature statistics (i.e. PER, TS%, 3P%, etc.) and in order to obtain a statistically sound result, the amount of data must be reduced by obtaining a set of principal components.

Linear Discriminant Analysis (LDA) is a a method used in statistics and machine learning to find a linear combination of features that characterizes or separates classes of objects. Put simply, LDA attempts to find a feature subspace that maximizes class separability. In this case, I used a player’s current position (i.e. point guard, shooting guard, small forward, power forward, and center) as the prior class. Next LDA found the linear combination of features that best separated the five classes and reduced the dimensions of the data into two dimensions. While Principal Component Analysis (PCA) is also a method for dimensionality reduction, I captured 71.85% of the data in two components with LDA while only capturing 54.46% of the data in two components with PCA.

Step 3: Cluster the Data with KMeans Clustering

Here is some R code which generates a data set and implements the algorithm (http://rossfarrelly.blogspot.com/2012/12/k-means-clustering.html)

KMeans Clustering is a simple and popular clustering algorithm that finds the cluster centers that best represent certain regions of the data. The algorithm alternates between assigning each data point to the closest cluster center and then setting each cluster center as the mean of the data points that are assigned to it. The algorithm finishes when the assignment of instances to clusters no longer changes.

The decision to have 8 clusters was based on the best silhouette score, which is a measure of how similar an object is to its own cluster compared to other clusters. The silhouette score ranges from -1 to 1, where a high value indicates that an object is well matched to its own cluster and poorly matched to neighboring clusters. If the clusters have a high value, then the clustering configuration is appropriate.

Step 4: Feature Extraction with Principal Component Analysis

Principal Component Analysis (PCA) is a common feature extraction method in machine learning. The algorithm finds the eigenvectors of a covariance matrix with the highest eigenvalues and then uses those values to project the data into a new subspace of equal or less dimensions. In feature extraction, PCA reduces the number of features by constructing a smaller number of variables that capture a significant portion of the data found in the original features. Using PCA, I identified the most important features in order to define each cluster.

Step 5: Data Visualization with Tableau

Tableau is a powerful application that renders data in a clean and concise way. In the plots below, I map out the total number of clusters in the NBA and highlight each cluster in detail as well as explore some Advanced Metrics for each cluster.

Results

Note: Due to technological constraints I was unable to include a table of each cluster’s feature importance, but would be happy to provide it. Please message me at @fastbreakdata.

Defensive Centers

Notable Players (in no particular order): DeAndre Jordan, Rudy Gobert, Hassan Whiteside, Andre Drummond, Karl-Anthony Towns, DeMarcus Cousins

3-and-D Wings

Notable Players (in no particular order): Kevin Durant, Klay Thompson, Kawhi Leonard, Jae Crowder, Trevor Ariza, Nicolas Batum

Scoring Wings

Notable Players (in no particular order): LeBron James, James Harden, Dwayne Wade, Lou Williams, Eric Gordon, Manu Ginobili, Joe Johnson

Versatile Forwards

Notable Players (in no particular order): Dirk Nowitzki, Giannis Antetokounmpo, Draymond Green, Ryan Anderson, Boris Diaw

Floor Generals

Notable Players (in no particular order): Chris Paul, John Wall, Rajon Rondo, Ricky Rubio, Jeff Teague, Dennis Schroder

Shooting Wings

Notable Players (in no particular order): Avery Bradley, Kentavious Caldwell-Pope, Devin Booker, J.J. Redick, Gary Harris

Combo Guards

Notable Players (in no particular order): Russell Westbrook, Stephen Curry, Jrue Holiday, Mike Conley, Kyrie Irving, Damian Lillard, Jeremy Lin

Offensive Centers

Notable Players (in no particular order): Anthony Davis, Kristaps Porzingis, Brook Lopez, Al Horford, Kevin Love, Blake Griffin

Advanced Metrics by Position

Now that we have classified NBA players into their natural positions, let’s take a look at some of the Advanced Metrics by position.

Average Win Shares by Position

Offensive Win Share is an estimate of the number of wins contributed by a player due to his offense. Defensive Win Share is the estimate of the number of wins contributed by a player due to his defense. Finally, Win Share is an estimate of the number of wins contributed by a player.

What’s quite revealing about this graph is the huge drop-off in Win Shares for Offensive Centers. A recent article by FiveThirtyEight questioned whether Jahlil Okafor, an Offensive Center, had a place in today’s NBA as a starting center. So far, it seems as though today’s NBA teams are, in fact, putting less and less value on Offensive Centers who cannot protect the rim.

Value Over Replacement Player (VORP) by Position

Value Over Replacement Player is a statistic spearheaded by Keith Woolner. VORP demonstrates how a “replacement player” performing at a “replacement level” provides a baseline in which to compare how much a player contributes to his team’s success. As seen in the Average Win Shares graph, Offensive Centers suffer the most in the VORP metric while Floor General-type point guards excel with Scoring Wings following right behind.

Player Efficiency Rating (PER) by Position

ESPN.com columnist John Hollinger first developed the Player Efficiency Rating (PER), a metric that measures a per-minute rating. In John’s words:

“The PER sums up all a player’s positive accomplishments, subtracts the negative accomplishments, and returns a per-minute rating of a player’s performance.”

Top 5-Man Lineups

This classification method can also determine which 5-man combinations excel in today’s NBA. While exploring the best 5-man combinations deserves more time and consideration, let’s just take a quick look at this season’s top 5-man lineups and see what type of players are in each lineup.

NBA Lineups (Source: http://stats.nba.com/lineups/traditional/#!?sort=PLUS_MINUS&dir=1)

Cleveland Cavaliers

Kyrie Irving (Combo Guard)
Lebron James (Scoring Wing)
DeAndre Liggins (3-and-D Wing)
Kevin Love (Offensive Center)
Tristan Thompson (Defensive Center)

Utah Jazz

Boris Diaw (Versatile Forward)
Dante Exum (Scoring Wing)
Rudy Gobert (Defensive Center)
Gordon Hayward (Shooting Wing)
Rodney Hood (Shooting Wing)

Golden State Warriors

Stephen Curry (Combo Guard)
Kevin Durant (3-and-D Wing)
Zaza Pachulia (Offensive Center)
Klay Thompson (3-and-D Wing)
Draymond Green (Versatile Forward)

Washington Wizards

Bradley Beal (Shooting Wing)
Marcin Gortat (Defensive Center)
Markieff Morris (Versatile Forward)
Otto Porter (3-and-D Wing)
John Wall (Floor General)

Los Angeles Clippers

Blake Griffin (Offensive Center)
DeAndre Jordan (Defensive Center)
Luc Mbah a Moute (Versatile Forward)
Chris Paul (Floor General)
J.J. Redick (Shooting Wing)

Quick Observations

Just glancing over these lineups, I noticed that teams with Offensive Centers compensate by pairing them with Defensive Centers or defensive players to protect the rim. This season, Warriors have begun using Kevin Durant as a rim protector, hence the compensation for Zaza’s inability to protect the rim. Both Kevin Love and Blake Griffin have Defensive Centers behind them, but have 3-point shooting in their game to keep them on the floor.

Another thing I noticed was the presence of Versatile Forwards in 4/5 lineups. This invaluable group of players was by far the most interesting position that I uncovered in the classification. They ranged from players that can be referred to as Stretch-4s like Ryan Anderson and Channing Frye to players like Boris Diaw, Giannis Antetokounmpo, and Draymond Green. But again, further investigation will need to be done to learn more about these Versatile Forward players.

Conclusion

Using machine learning, I classified NBA players by their feature statistics into natural clusters that best match their skill sets. The clusters that the algorithms have constructed identify which features are most important to a player and group them in such a way that is easily interpretable and inherently understood by players, coaches, and fans alike.

While further investigation into play-by-play data and an analysis of every player in NBA history may also enrich and improve the results, it is evident that the natural clusters in just a three-year timespan immediately define more than five positions in basketball. In addition, a more in-depth exploration into each cluster may reveal insight into how teams can scout and develop the next Manu Ginobili or Draymond Green.

Cork Gaines/Business Insider (http://www.businessinsider.com/nba-three-point-shooting-2016-3)

While both Alagappan and Shea brought unique approaches to player classification, I argue that there is not an absolute nor correct number of positions in basketball. Trends in the NBA are constantly changing and this study was intended to provide just a snapshot of today’s NBA players.

The clusters that the machine learning algorithms created are specifically fit to today’s NBA players and, as recent trends suggest, there will always be new players that completely defy prior expectations of positional roles.

Conducting this type of analyses at the collegiate level may also present telling results. With collegiate player data, scouts and front office executives may have a better understanding of whom to compare that collegiate player with in the NBA.

It is my hope that teams incorporate the results of this study to maximize the skill sets of their players and design offensive and defensive schemes that best put players in positions to succeed. I envision the results of this study having the most impact for teams defensively. Teams indubitably are cognizant of the strengths and weaknesses of their players, but during the season, innately knowing whether or not an opponent is going to pass, drive, or shoot may make all the difference in a late game possession.