Analyzing NBA Player Similarities for Recruitment Insights

Arshul Shaik
INST414: Data Science Techniques
9 min readMay 15, 2024

In the NBA, teams face increasing competition as both players and organizations continuously strive to improve. Players dedicate themselves to honing their skills, while teams seek innovative strategies to gain an edge. This environment places greater importance on recruiting players who not only possess talent but also demonstrate potential for growth. As such, leveraging statistical performance for recruitment becomes crucial for teams aiming to remain competitive and achieve long-term success.

The motivating question we are addressing is: How can NBA teams recruit players who exhibit statistical similarities to high-performing players, even when those players are not available for trade or acquisition? This question is of utmost importance to stakeholders such as general managers (GMs), coaches, and talent scouts who are responsible for identifying and acquiring players that align with their team’s strategic objectives and playing style. By leveraging statistical analysis to evaluate player performance, teams can uncover hidden gems, identify players with skill sets that complement their existing roster, and make informed recruitment decisions. The actionable insight derived from this analysis enables stakeholders to build competitive teams, maximize player contributions, and ultimately enhance their chances of success on the court.

To effectively answer the question of how NBA teams can recruit players who exhibit statistical similarities to high-performing players, an ideal dataset would include comprehensive player statistics covering various aspects of their performance on the court. Key attributes would include player demographics (such as age), playing statistics (points scored, field goal percentage, rebounds, assists, steals, blocks), team performance metrics (games won and lost, minutes played), and fantasy points.

The dataset I found on Kaggle appears to align well with the ideal dataset for addressing this question. It contains a wide range of player statistics, including points scored (PTS), field goal percentage (FG%), three-point percentage (3P%), free throw percentage (FT%), rebounds (REB), assists (AST), steals (STL), blocks (BLK), and various other metrics. Additionally, it includes attributes such as player position (POS), team affiliation (Team), and player age (Age), which provide further context for analysis.

In terms of data collection, the dataset likely originated from official NBA sources, sports analytics databases, or compiled by individual contributors. For cleaning the dataset, I relied on a combination of techniques. I utilized parts of the CSV file itself, parsing through the data to identify relevant attributes and discard any redundant or irrelevant information. Drawing from my knowledge of the NBA and its statistical conventions, I decided which variables to retain and which to discard, ensuring that the dataset remained focused on key player performance metrics. Additionally, I applied my understanding of NBA statistics to handle missing values, correct inconsistencies, and standardize formats across different attributes. This process allowed me to tailor the dataset to suit the specific requirements of the analysis while maintaining data integrity and relevance.

My project builds upon concepts from both Module 3 and Module 4 of the course. In Module 3, we explored similarity-based methods, particularly cosine similarity, as a means to quantify the resemblance between data points. I applied this concept by computing cosine similarity scores between NBA players based on their performance metrics, allowing me to identify players with similar playing styles or skill sets. Furthermore, in Module 4, we delved into clustering techniques as a method to group similar data points together. Leveraging this knowledge, I employed clustering algorithms to group NBA players based on their similarity scores, thereby identifying clusters of players who exhibit comparable performance profiles. By combining these methods, I was able to analyze the NBA player dataset, compute similarity scores, and subsequently cluster players based on their statistical performance. This approach provided a comprehensive framework for identifying and recruiting players who exhibit similarities to high-performing individuals, contributing to informed decision-making processes for NBA teams.

As we delve into the intricate analysis of NBA players based on their statistical performance, we identified key metrics that we believe encapsulate the essence of a player’s contribution on the court. Focusing on statistics such as points scored (PTS), field goal percentage (FG%), three-point percentage (3P%), free throw percentage (FT%), rebounds (REB), assists (AST), steals (STL), and blocks (BLK), we aimed to capture the multifaceted nature of player performance. These metrics serve as crucial indicators in discerning the prowess and impact of players in the dynamic realm of basketball. In our quest to unravel similarities between NBA athletes, we honed in on two iconic figures: Jayson Tatum and Joel Embiid. Renowned for their exceptional skills and contributions to the modern game of basketball, these players epitomize excellence on the court and serve as beacons of inspiration for aspiring athletes worldwide.

We began by normalizing the data using the Min-Max Scaler, ensuring that each feature’s values were scaled to fall within the same range. This step was crucial for maintaining consistency across different metrics and facilitating meaningful comparisons between players. Subsequently, we computed the cosine similarity between the normalized data points, resulting in a similarity matrix that quantified the resemblance between each pair of players.

To identify the top similar players for a given NBA athlete, we designed a custom function named get_top_similar_players. This function takes as input the index of the player of interest, the similarity matrix, and the dataset containing player information. Leveraging the cosine similarity scores, the function sorts the indices of players based on their similarity scores in descending order. It then retrieves the names of these similar players from the dataset, excluding the player of interest, and returns them as a list.

# Normalize the data
scaler = MinMaxScaler()
data_normalized = scaler.fit_transform(data[features])

# Compute cosine similarity
similarity_matrix = cosine_similarity(data_normalized)

# Define function to get top N similar players with their data
def get_top_similar_players(player_index, similarity_matrix, data):
# Get similarity scores for the player
player_similarities = similarity_matrix[player_index]
# Sort indices based on similarity scores
similar_player_indices = player_similarities.argsort()[::-1] # Exclude the player itself
# Get player names
similar_players = data.iloc[similar_player_indices]['PName'].values
return similar_players

In exploring player similarities within the NBA, intriguing parallels emerged among some of the league’s top athletes. For Jayson Tatum, a mix of established veterans like LeBron James and rising stars like Paolo Banchero highlighted a diverse range of skills. Meanwhile, Joel Embiid’s list showcased a blend of dominant big men like Kristaps Porzingis and agile forwards like Jayson Tatum, offering insights into the evolving landscape of NBA talent. These findings underscored the multifaceted nature of player resemblances, providing valuable insights for coaches and fans alike.

Top 10 similar players to Jayson Tatum:
['Jayson Tatum' 'Paolo Banchero' 'Pascal Siakam' 'Jaylen Brown'
'LeBron James' 'Julius Randle' 'Luka Doncic' 'Giannis Antetokounmpo'
'Joel Embiid' 'Zach LaVine']

Top 10 similar players to Joel Embiid:
['Joel Embiid' 'Kristaps Porzingis' 'Anthony Davis' 'Jayson Tatum'
'Paolo Banchero' 'Giannis Antetokounmpo' 'Evan Mobley' 'P.J. Washington'
'Aaron Gordon' 'Pascal Siakam']

We developed a function called `get_all_similar_players_with_scores` to facilitate the retrieval of all players similar to a designated NBA player alongside their corresponding similarity scores. This function takes the index of the target player, the previously computed similarity matrix, and the dataset containing player data as inputs. It then computes the similarity scores between the target player and all others, arranges them in descending order, and extracts the names of the similar players along with their respective similarity scores.

def get_all_similar_players_with_scores(player_index, similarity_matrix, data):
# Get similarity scores for the player
player_similarities = similarity_matrix[player_index]
# Sort indices based on similarity scores
similar_player_indices = player_similarities.argsort()[::-1][1:] # Exclude the player itself
# Get player names
similar_players = data.iloc[similar_player_indices]['PName'].values
# Get similarity scores
similarity_scores = player_similarities[similar_player_indices]
return similar_players, similarity_scores

This functionality proves invaluable in the realm of NBA analytics, empowering teams and analysts to identify players who share similar performance profiles with a given player. For example, when applied to Jayson Tatum, the function generates a dictionary named `similar_players_dict`, which contains the names of players similar to Tatum and their associated similarity scores. These scores offer quantitative insights into the degree of resemblance between Tatum and each identified player, enabling informed decision-making processes in various aspects of team management and strategy formulation. Leveraging this similarity analysis, teams can make data-driven decisions regarding player acquisitions, team composition, and tactical approaches, thereby enhancing their competitive prowess within the league.

We initiated a clustering process to group NBA players into distinct clusters based on their similarity scores, employing threshold values and corresponding ranges to delineate these clusters. These thresholds, namely 0.9, 0.8, and 0.7, were accompanied by specific ranges: (0.9, 1.0), (0.8, 0.89), and (0.7, 0.79) respectively.

# Define threshold values
thresholds = [0.9, 0.8, 0.7]

# Initialize dictionaries to store clusters for each threshold
clusters = {threshold: {} for threshold in thresholds}

# Iterate through similarity scores and group players into clusters for each threshold
for player, score in similar_players_dict.items():
for threshold in thresholds:
if score >= threshold:
if threshold in clusters and len(clusters[threshold]) < 10: # Limit to top 10 players per cluster
if score in clusters[threshold]:
if len(clusters[threshold][score]) < 10:
clusters[threshold][score].append(player)
else:
clusters[threshold][score] = [player]

In order to form these clusters, we utilized a dictionary structure, `clusters`, with each threshold serving as a key. Within each threshold, we aimed to compile up to the top 10 clusters. Traversing through the similarity scores obtained earlier, we sorted players into their respective clusters based on their similarity scores falling within the designated threshold range.

This clustering approach allowed us to categorize players into groups sharing similar performance characteristics, providing a structured framework for analyzing player similarities within the NBA dataset. By organizing players into these clusters, we gained valuable insights into the distribution of player resemblances across different similarity score ranges, facilitating a more nuanced understanding of player dynamics and contributing to informed decision-making processes for teams and analysts alike.

The analysis provides stakeholders in NBA team management with a systematic approach to identifying and recruiting players who exhibit statistical similarities to high-performing individuals like Jayson Tatum and Joel Embiid. By utilizing cosine similarity and clustering techniques, teams can efficiently assess player performance metrics and pinpoint candidates who complement their strategic objectives and playing style. Custom functions streamline the process of retrieving similar players, offering quantitative insights into the degree of resemblance between target players and potential recruits. The clustering analysis categorizes players into distinct groups based on their similarity scores, providing a structured framework for understanding player dynamics and distributions. Armed with these insights, stakeholders can make data-driven decisions regarding player acquisitions, team composition, and tactical approaches, ultimately enhancing their team’s competitive edge and prospects for success in the NBA.

There are several limitations to acknowledge in this project. Firstly, it relies on a single dataset from the 2023 NBA season, which may not capture the full spectrum of player performance or reflect recent developments in player skills and strategies. Additionally, the dataset’s accuracy may be compromised by factors such as injuries, changes in playing form, or player transfers mid-season, which could skew the statistical measurements and affect the reliability of the analysis.

Furthermore, the analysis primarily focuses on a limited set of performance metrics, including points scored (PTS), field goal percentage (FG%), three-point percentage (3P%), free throw percentage (FT%), rebounds (REB), assists (AST), steals (STL), and blocks (BLK). While these metrics offer valuable insights into player performance, they may not encompass all aspects of a player’s contribution to the team or fully capture their playing style.

While clustering techniques provide a useful framework for grouping players based on similarity scores, they are inherently subjective and reliant on the choice of similarity metric and clustering algorithm. Different methods may yield varying results, and the interpretation of clusters may be subjective, requiring expert judgment and domain knowledge for meaningful insights.

In conclusion, while this project offers valuable insights into player recruitment based on statistical performance, it is essential to acknowledge its limitations and exercise caution in interpreting the results. Future research could explore additional datasets, incorporate more comprehensive performance metrics, and address ethical considerations to enhance the robustness and validity of the analysis.

GitHub Repository: https://github.com/arshuls/INST414.git

--

--