Predicting Successful Three Point Shooter for NBA Scouts

Actionable Insight

Published in

INST414: Data Science Techniques

14 min readDec 17, 2023

The project aims to solve a critical issue for basketball coaches and NBA general managers: identifying the ideal 3-point shooter. This endeavor is vital for player recruitment, team strategy formulation, and enhancing overall team performance. By analyzing the correlation between physical attributes and 3-point shooting efficiency, this project seeks to derive a predictive model that can guide decision-making in player selection.

The motivation for this project stems from the evolving dynamics of modern basketball, where the 3-point shot has become a game-changing element. In recent years, the NBA has witnessed a paradigm shift, with an increased emphasis on perimeter shooting, fundamentally altering how the game is played and strategies are formulated. This transformation has elevated the importance of proficient 3-point shooters, making them crucial assets for any competitive team. This role’s evolution highlights a strategic shift in basketball, underscoring the transformative impact of long-range shooting on the game’s competitive landscape.

Furthermore, the motivation is fueled by the business implications of these decisions. Effective player recruitment and development directly impact a team’s performance, fan engagement, and financial success. Identifying the ideal 3-point shooter can lead to more victories, increased ticket sales, and greater merchandise revenue, making this not only a sports strategy question but also a significant business consideration.The rise of the 3-point shooter has fundamentally altered team dynamics, necessitating new approaches in offensive and defensive game plans

In the evolving landscape of basketball, the importance of 3-point shooting cannot be overstated. Identifying players with the potential to excel in this aspect is crucial. The challenge lies in determining what physical attributes and playing statistics are most indicative of a successful 3-point shooter. The primary objective is to predict what makes a good 3-point shooter by examining correlations between physical attributes (such as height and weight) and 3-point shooting percentages. The hypothesis is that certain physical and performance metrics significantly influence a player’s ability to effectively score 3-point shots.

The analysis will involve a comprehensive review of current and historical NBA player data. Key metrics include:

3-point percentage (3P%)
3-point attempts (3PA)
Field goal percentage (FG%)
Free throw percentage (FT%)
Free throw attempts (FTA)
Height(CM)and weight(KG)

Preliminary analysis suggests that 3-point shooting efficiency is not solely dependent on physical attributes like height and weight. Instead, a combination of physical traits and practiced skills (evidenced by FG%, FT%, and attempts) plays a crucial role. For instance, taller players may have an advantage in shooting over defenders, but this does not inherently make them better 3-point shooters. Similarly, players with a high free throw percentage often exhibit strong shooting mechanics, which can translate to 3-point shooting success. Based on these insights, NBA teams should consider a holistic view of a player’s attributes and statistics when scouting for potential 3-point shooters. While height and weight are important, they should be weighed alongside shooting statistics to predict 3-point shooting success. Teams may also consider investing in player development programs focusing on shooting mechanics, especially for players showing potential in free throw accuracy.

The success of this project will be measured by the accuracy of the predictive model in identifying potential 3-point shooters. This can be quantitatively assessed by comparing predicted 3-point shooting efficiencies with actual performances in subsequent seasons. Additionally, the adoption rate of these insights by NBA teams and their impact on player recruitment and team performance will serve as qualitative success indicators.

This analysis provides a strategic framework for basketball teams to identify potential 3-point shooters. By combining physical attributes with performance statistics, teams can make more informed decisions in player selection and development, enhancing their competitive edge in the league. The project’s success will ultimately be reflected in its adoption by NBA teams and its impact on the evolution of basketball strategy.

Data

For the report our team is taking a look at stats from the previous NBA seasons to help NBA scouts determine what makes a quality three point shooter. To facilitate this the team began to research what were the best data sets involving NBA data. These datasets were selected for their comprehensive coverage and reliability, offering a robust foundation for our analysis. We also created the constraints that the dataset needed to have certain aspects like player name, player height, player weight, 3 point shots attempted, 3 point shots made, and 3 point percentage. From this the team found the NBA api package that you can install in python. This package contains data directly from NBA.com but it did not have everything that was needed to conduct the report. The data that was in the NBA dataset contained everything statistical about a player but was missing the players height and weight. This led to more of a web scraping approach for our data. Navigating through various data sources presented unique challenges, particularly in ensuring data compatibility and comprehensiveness. Since the NBA library did not contain the height and weight of the players the team found another dataset that was available on Kaggle.com that contained players heights and weights, it will be stored in a csv file. Our search led us to two specific datasets, the first being, “NBA Player Stats Dataset for the 2023–2024” (available here), which provided detailed player statistics, including performance metrics like 3-point shooting stats. The second dataset, “NBA Players Data” (found here), offered comprehensive physical attributes of the players, such as height and weight. These datasets were downloaded in CSV format for ease of use. Using both of these data sets the team used Pandas to create a new dataframe using both sets of the data, pulling only the height and weight from the csv file while the rest should come from the NBA api package.

For the Api, our player’s stats will roughly be around 600 rows with the columns are the players’ score stats such as 3pct, gp, team association,MIN, FGM, FGA, FTM, FTA, PTS, FG3M, FG3A, and GP. The data size for the players’ stats is more substantial than initially estimated; it’s approximately 450 kilobytes, depending on the breadth of data required to match the other dataset. For the player’s physical attributes data, the rows will be 599 rows with 22 columns which we only need the height and weight of each player as well as their full name with first and last name. We need the name of the players so we can use it to merge the dataset of player’ physical attributes with the player’s scoring stats. The number of elements/columns in this data is 22 but we only need the players name, height, and weight so we have only 3 elements that will be transferred to the player’s score dataset. The size of this refined dataset is approximately 250 kilobytes, considering the data points we are extracting. The NBA API provides a fresh stream of data directly ingested into Python’s versatile in-memory structures. We’ve chosen the Pandas framework, known for its efficient data manipulation capabilities, to handle this. While working with the NBA API, this real-time approach offers flexibility, allowing us to modify, filter, and transform as required swiftly. For player height and weight, we extract information from a dataset on Kaggle, formatted as a CSV. The universal nature of CSV files makes them a preferred choice for this project, given their wide acceptance and ease of handling. Pandas, once again, is our tool of choice to navigate through columns like player names, height, and weight. Upon retrieving data from both sources, their integration is crucial. Using player names as a common link, we blend the data from the NBA API and Kaggle, resulting in a rich, unified dataset. This dataset, a mix of player stats and physical metrics, is maintained as an in-memory data frame, optimizing real-time operations. We store the integrated dataset as a CSV file to ensure that our repeated analyses are not hindered by a constant need to fetch live data. This static version is quicker to access and eliminates any dependency on continuous data sources.

Data cleaning prior to analysis:

Through our process of data-exploration we have identified two datatables which contain the necessary data needed to predict successful three point shooters. Our current objective is merging these two tables to get a comprehensive understanding of our data. As mentioned above, we need to link these two separate datasets together in order to obtain a comprehensive understanding of NBA players in game statistics as well as their physical attributes. We plan to do this by joining the two datasets with the players name as the key. To join two dataframes we must set the key to be the index in both dataframes. In order to make this join successful, the two indexes have to be the same. This means that the names of the players in both tables must be equal. We have begun to clean the datasets so that the syntax of the names is the same. This meticulous cleaning process is critical for maintaining the integrity of our analysis, ensuring that our conclusions are based on accurate and consistent data. We are changing all the letters so that just the first letter of the name is capitalized and removing middle initials. Standardizing player names is crucial for accurately merging datasets, eliminating discrepancies that could skew our analysis. We are working on ordering the names in the following format: “first name, last name”. An example of our cleaned names is as follows: “Lebron, James”. We are also working on removing rows with missing values in the columns of interest. We are filtering the data to remove players who may be missing data in columns such as height, weight, shots made, etc. Each method was chosen for its ability to uncover distinct yet complementary insights, thereby enriching our overall understanding of what constitutes an effective 3-point shooter.

Rationale

Some of the key ideas from the course that our project drew from is that fundamentally there should be players who are successful three point shooters that have similarities with other successful three point shooters. This idea is mainly related to homophily and based on that we can try to identify certain relationships and structures in the data between players through unsupervised Machine Learning methods such as clustering. Once we properly identify the relationships that successful three point shooters have in our quest to discern common traits among elite three-point shooters in the NBA, we anchored our study on homophily. This notion posits that similarities in key attributes often lead to similar skills or tendencies. Our project explores this idea within sports analytics, aiming to pinpoint characteristics that define proficient three-point shooters.

Our analytical approach encompassed a variety of techniques:

Grouping Through Machine Learning: We leveraged unsupervised machine learning, particularly clustering algorithms, to categorize NBA players. By examining factors like their on-court performance metrics, physical characteristics, and three-point shooting proficiency, we could unearth patterns and commonalities shared by top-notch three-point shooters. In tailoring our approach, we went beyond basic clustering algorithms to adapt them specifically for basketball analytics. As reflected by playing time, we evaluated many player characteristics, including assist-to-turnover ratios, defensive performance, and stamina. This comprehensive approach allowed us to construct detailed player profiles, leading to refined groupings that capture the multifaceted aspects of basketball performance.

Investigating Relationships Between Variables: A comprehensive analysis of correlations was undertaken to unravel the interplay between different player attributes, such as height, weight, and shooting percentages. This was a pivotal part of our research, helping us identify the traits most influential in a player’s capacity to execute three-point shots successfully. We expanded our investigation beyond simple correlation to include statistical techniques such as factor and principal component analysis. These methods enabled us to distill our dataset to its core components, revealing the primary factors contributing to variations in shooting ability. For instance, we looked into how a player’s ability to perform under high-pressure situations, as indicated by clutch performance metrics, relates to their proficiency in three-point shooting.

Predictive Modeling with Regression Analysis: We employed regression analysis as a tool for prediction. These models were instrumental in delineating how a player’s physical features and on-court statistics correlate with their aptitude for three-point shooting, allowing us to forecast a player’s potential in this area. Our approach to regression modeling was comprehensive, encompassing both logistic and polynomial regressions to grasp our data’s complex, nonlinear dynamics. We integrated metrics like player efficiency ratings and their developmental trajectories over time to forecast their current and future capabilities in three-point shooting. This method yielded a dynamic model that adjusts to the changing trajectories of players’ careers.

Validation and Assessment of Our Models: We applied cross-validation methods to test the robustness and applicability of our models. This involved partitioning our dataset into separate training and testing segments to evaluate our models’ predictive performance on new, untested data. We utilized a combination of cross-validation and bootstrapping techniques for model validation to confirm model robustness. We challenged our models with diverse scenarios, including analysis of playoff performances and debut season data, to test their effectiveness in various career phases and under different stress levels.

Data Visualization for Insight Communication: We utilized a range of data visualization techniques to convey our findings clearly. These graphical representations played a crucial role in making our analysis understandable and engaging, illustrating the intricate patterns and relationships we discovered. We enhanced our data visualization strategies by incorporating interactive features into our graphical representations. For example, we created interactive dashboards where scouts can simulate outcomes based on hypothetical player statistics. Additionally, we employed heat maps to effectively demonstrate the distribution of critical attributes among elite three-point shooters, offering an immediate visual interpretation of our data.

The interplay of these methods allows us to construct a multifaceted analysis, tapping into different dimensions of players’ performances and attributes.

Insight and Analysis

For the data, we wanted to split each group by their weight/height as well as their stats into clusters. We decided to use clustering analysis for our data because we felt by combining similar groups of players by their height/weight, then it would be easier to examine each clusters by their stats to determine which clusters have the highest stats. We felts this would be easier to identify whether height/weight play an important part into determining better stats by clustering the players into clusters rather than individual players. We decided to use the elbow method to determine the K value for the KMeans ranging around 1 to 30 clusters for the best possible number of clusters for analysis. From elbow method, we saw that 8 was the best number for K value and we created the clusters data frame and merged it with our feature matrix for data visualization of cluster analysis. For data visualization, we used scatterplots for each stats of the clusters to measures which clusters have better performance per stats in the matrix. We examine which cluster have the highest numbers pr stat out of the rest and determine overall which cluster have the best performance. For cluster representation, each clusters represents a group of players with similar heights/ weights ranking from highest to lowest with cluster 5,6,7 having the highest/heaviest and rest are around the low/lightest range. Based on the scatterplots, we can conclude that cluster 6 have overall better stats than rest of the cluster and perform well in PTS scores (point Scored). Cluster 6 have relatively heavier weights 100 KG and average 200 CM for all players within this cluster.

Elbow method graph

Number of clusters:

Scatterplot the clusters and pts:

For linear regression we wanted to test how the linear regression could be applied to predict three point shooting percentages based on player statistics. First we cleaned the data. Relevant columns were selected, and rows with missing values in those columns were dropped. Next the ‘3P%’ column, representing three-point shooting percentages, is converted to numeric format if it’s initially in string format. After cleaning the data we prepared for conducting the linear regression.We split the data into training and testing and made predictions with the model. We also evaluated the mean squared error and R-squared value to evaluate the model’s performance.

From our results that we got we were able to identify key features from the data set which could be relevant predictors of success in identifying good three point percentages however from looking at our mean squared error it seems as if our model has overfitted to the data.
These findings provide NBA scouts with a nuanced understanding of player profiles, aiding in identifying promising 3-point shooting talents.
While our model offers valuable insights, it’s crucial to acknowledge potential limitations and biases, paving the way for further refinement and validation.

Limitations

While our project provides valuable insights into what constitutes a quality three-point shooter in the NBA, it’s important to acknowledge its limitations in scope, data, methodology, and ethical implications. These limitations must be considered when interpreting the results and making practical applications in the real world of basketball scouting and analytics.

Some of the limitations of our project especially with respect to the linear regression analysis we conducted was that the mean squared error we observed was quite low. This could be indicative of overfitting which means our model is learning the training data too well, capturing the noise and other fluctuations of our training data and will not generalize well to new unseen data.

Throughout this project we really only analyzed three point shooters and omitted other players. Our project specifically targets attributes that make a quality three-point shooter. This narrow focus might overlook other important aspects of a player’s performance and overall contribution to the team. Additionally the data reflects performance in previous NBA seasons, which may not account for evolving strategies, player development, or changes in team dynamics.

Data Limitations

The initial dataset from the NBA API lacked critical physical attributes like height and weight, necessitating additional data sourcing. This piecemeal approach to data collection introduces potential inconsistencies. Merging datasets from different sources (NBA API and Kaggle) posed challenges in ensuring data compatibility, especially in terms of data formats and naming conventions.
The process of cleaning data (e.g., standardizing player names, removing rows with missing values) might inadvertently introduce some errors. Finally we referenced some historical data. The reliance on historical data may not capture current trends or emerging talents in the NBA.

Analysis Constraints

There were few variables that we considered important when drawing our analysis. This limited focus on certain variables may have potentially led to less accurate results. Focusing primarily on physical attributes and shooting statistics may overlook other significant factors influencing a player’s three-point shooting abilities, like mental resilience, team dynamics, or coaching style. The chosen analytical methods and algorithms could influence the outcomes. Different techniques or models might yield varying interpretations of the same data.

Appendix

Arvind: worked on linear regression code and questions 5 and 7
Mahat: Worked on questions 3 and 4.
David: Worked on questions 2, 3 and 7
Graham: Worked on code and question 5,6
Calvin: Worked on code and question 5,6

Link to github repo:

https://github.com/AJ-comm/NBA-3pt-shooter/tree/main