Regression Analysis: NBA Player Stats — Predicting Regular Season and Postseason PPG

Anaswar Jayakumar
27 min readJul 19, 2023

Overview

This project is one of several such linear regression projects that were done that pertain to sports analytics. In particular this specific project involves the analysis of NBA regular season and postseason data while other regression projects involve the analysis of NFL data; specifically passing, rushing, and receiving data.

In this project, Python was the language of choice although R could have certainly been used as well. I personally find that Python is much more suited compared to R as the regression analysis portion of this assignment will involve machine learning techniques that are better suited for Python compared to R.

In this project, player stats for the 2021–2022 NBA regular season and postseason were analyzed on a per game basis and the data was obtained from Basketball Reference, an online website for obtaining sports data. The following is the link to the CSV file that was used for this project: https://www.basketball-reference.com/leagues/NBA_2022_per_game.html.

Objective

The objective of this project is to predict a given player’s points per game (PPG) for both the regular season and postseason using a multitude of different variables of interest. In the dataset, PPG was already calculated using the following formula:

PPG is an important metric because it essentially measures player performance as well as a player contribution to the number of points a team can score on a given night. When a player has a higher PPG, the team naturally will tend to score more points on a per game basis.

Moreover, given that the average points per game scored by teams is now 114.2 points per game (2022–2023 season), PPG is an even more important metric. Predicting PPG will not only allow teams to better gauge player performance on the court but will certainly help teams during the trade deadline/free agency as well.

Teams are always looking for players that can help score when needed and as a result, PPG is certainly a metric that is worth prioritizing

Review of Data Sources

The data that was used for this assignment (2021–2022 NBA Player Stats — Regular.csv, 2021–2022 NBA Player Stats — Playoffs.csv) was provided by Basketball Reference and the pandas library in Python was used to load the data into the following dataframes: nba_data_reg (NBA Player Stats — Regular Season) and nba_data_post (NBA Player Stats — Postseason).

Both dataframes contain thirty columns, but the NBA Player Stats Regular Season dataframe contains 812 rows while the NBA Player Stats Postseason dataframe only contains 217 rows. Neither the NBA Player Stats Regular Season dataframe nor the NBA Player Stats Postseason dataframe contain any null/missing values so imputation wasn’t required.

However, columns that weren’t necessary for the analysis were dropped and columns were renamed so that they made more sense. In particular, the column Rk was dropped from the dataframes and the following columns were renamed as well:

  • 2P (TwoPointersMade)
  • 2P (TwoPointersPercentage)
  • 2PA (TwoPointersAttempted)
  • 3P (ThreePointersMade)
  • 3P% (ThreePointersPercentage)
  • 3PA (ThreePointersAttempted)
  • FG (FieldGoalsMade)
  • FG% (FieldGoalsPercentage)
  • FGA (FieldGoalsAttempted)
  • eFG% (EffectiveFieldGoalsPercentage)
  • FT (FreeThrowsMade)
  • FT% (FreeThrowsPercentage)
  • FTA (FreeThrowsAttempted).

After dropping any unnecessary variables and renaming columns, exploratory data analysis was then performed for both dataframes. The following table summarizes the variables that are present in the NBA Player Stats Regular Season and NBA Player Stats Postseason dataframes

Exploratory Data Analysis (EDA)

EDA was the next step of this project, the goal being to get a better understanding of the data at large. EDA is comprised of three such components: descriptive statistics, histograms, and correlation analysis.

For the purposes of this article, I will focus the EDA more on the histograms and the correlation analysis since both were instrumental in the subsequent regression analysis portion of this project.

Histograms were generated to better understand the underlying distribution of the independent variables while correlation analysis was instrumental in determining the predictor variables that will ultimately be used to predict points per game (PPG).

In particular, the EDA focused on the following for both the regular season and postseason:

  • playing time (games, games started, minutes played)
  • field goals shooting performance (field goals made, field goals attempted, field goals percentage, effective field goals percentage)
  • three pointer shooting performance (three pointers made, three pointers attempted
  • three pointers percentage)
  • two pointers shooting performance (two pointers made, two pointers attempted
  • two pointers percentage)
  • free throw shooting performance (free throws made, free throws attempted, free throws percentage)
  • rebounding performance (offensive rebounds, defensive rebounds, team rebounds)
  • other metrics (Assists, Steals, Blocks, Turnovers, Personal Fouls Drawn).

Histograms — Regular Season

Playing Time

The variable games played seems to mostly resemble a multimodal distribution, the variable games started seems to resemble a positively (right) skewed distribution, and the variable minutes played seems to mostly resemble a normal distribution.

The mean of the variables games played, games started, and minutes played are 36.704433, 16.672414, and 18.265394 respectively while the standard deviation are 25.899099, 23.817195, and 9.648292 respectively.

The distributions of the variables games played, games started, and minutes played imply the following:

  • In the distribution of games played, multiple peaks are present. In particular, one such peak is present between approximately zero and ten games played while other peaks are present at approximately twenty games played and sixty games played.
  • In the distribution of games started, most players start an average of approximately zero to ten games but with a long tail of players that start more games. This probably indicates that the majority of players in the data are either predominately bench players and hence get very little playing time or role players that play in spurts and hence don’t get consistent playing time.
  • In the distribution of minutes played, NBA players on average get approximately twenty minutes of playing time per game with some players that get more playing time and other players that get less playing time

Shooting Performance (Field Goals)

The variables field goals made, and field goals attempted seem to resemble a positively (right) skewed distribution while the variables field goals percentage and effective field goals percentage seem to mostly resemble a normal distribution albeit with some outliers present on the left hand side and right hand side.

The mean of the variables field goals made, field goals attempted, field goals percentage and effective field goals percentage are 2.869951, 6.386576, 0.426235, and 0.488293 respectively while the standard deviations are 2.223988, 4.651121, 0.148525, and 0.155930 respectively.

The distributions of the variables field goals made, field goals attempted, field goal percentage and effective field goal percentage imply the following:

  • In the distribution of field goals made, most NBA players make an average of approximately zero to five field goals but with a long tail of players who make more field goals
  • In the distribution of field goals attempted, most NBA players attempt an average of approximately zero to ten field goals but with a long tail of players who attempt more field goals
  • In the distribution of field goals percentage, NBA players on average have a field goal percentage of approximately 0.50 (50%) with some players having a higher field goal percentage and some players having a lower field goal percentage. Its worth noting that approximately forty players have a field goal percentage of zero (0%) while very few players have a field goal percentage of exactly one (100%)
  • In the distribution of effective field goals percentage, NBA players on average have an effective field goal percentage of approximately 0.50 (50%) with some players having a higher effective field goal percentage and some players having a lower effective field goal percentage. Its worth noting that approximately forty players have an effective field goal percentage of zero (0%) while very few players have an effective field goal percentage of exactly one (100%)

Shooting Performance (Three Pointers)

The variables three pointers made, and three pointers attempted seem to resemble a positively (right) skewed distribution while the variable three point percentage seems to mostly resemble a normal distribution albeit with some outliers present on the left hand side and right hand side.

The mean of the variables three pointers made, three pointers attempted, and three pointers percentage are 0.871305, 2.560591, and 0.276538 respectively while the standard deviations are 0.841935, 2.205642, and 0.157579 respectively.

The distributions of the variables three pointers made, three pointers attempted, and three pointer percentage imply the following:

  • In the distribution of three pointers made, most NBA players make an average of approximately zero to one three pointer shots but with a long tail of players who make more three pointers
  • In the distribution of three pointers attempted, most NBA players attempt an average of approximately zero to four three pointer shots but with a long tail of players who attempt more three pointers
  • In the distribution of three point percentage, NBA players on average have a three point percentage of approximately 0.40 (40%) with some players having a higher three point percentage and some players having a lower three point percentage. Its worth noting that approximately 140 players have a three point percentage of zero (0%) while very few players have a three point percentage of exactly one (100%)

Shooting Performance (Two Pointers)

The variables two pointers made, and two pointers attempted seem to resemble a positively (right) skewed distribution while the variable two pointers percentage seems to mostly resemble a normal distribution albeit with some outliers present on the left hand side and right hand side.

The mean of the variables two pointers made, two pointers attempted, and two pointer percentage are 2.000123, 3.828695, and 0.488091 respectively while the standard deviations are 1.762505, 3.192736, and 0.180538 respectively.

The distributions of the variables two pointers made, two pointers attempted, and two pointer percentage imply the following:

  • In the distribution of two pointers made, most NBA players make an average of approximately zero to two two pointer shots but with a long tail of players who make more two pointers
  • In the distribution of two pointers attempted, most NBA players attempt an average of approximately zero to five two pointer shots but with a long tail of players who attempt more two pointers
  • In the distribution of two point percentage, NBA players on average have a two point percentage of approximately 0.50 (50%) with some players having a higher two point percentage and some players having a lower two point percentage. Its worth noting that approximately fifty players have a two point percentage of zero (0%) while approximately twenty players have a two point percentage of exactly one (100%)

Shooting Performance (Free Throws)

The variables free throws made, and free throws attempted seem to resemble a positively (right) skewed distribution while the variable free throw percentage seems to mostly resemble a negatively (left) skewed distribution albeit with some outliers present on the right hand side.

The mean of the variables free throws made, free throws attempted, and free throw percentage are 1.204433, 1.575246, and 0.658267 respectively while the standard deviations are 1.287991, 1.585894, and 0.283491 respectively.

The distributions of the variables free throws made, free throws attempted, and free throw percentage imply the following:

  • In the distribution of free throws made, most NBA players make an average of approximately zero to two free throws but with a long tail of players who make more free throws
  • In the distribution of free throws attempted, most NBA players attempt an average of approximately zero to three free throws but with a long tail of players who attempt more free throws
  • In the distribution of free throw percentage, most NBA players have an average free throw percentage of approximately 0.8 (80%) but with a long tail of players who have a lower free throw percentage. It’s worth noting that a little more than 100 players have a free throw percentage of 0 (0%)

Rebounding Performance (Offensive, Defensive, Team)

The variables offensive rebounds, defensive rebounds, and total rebounds seem to resemble a positively (right) skewed distribution.

The mean of the variables offensive rebounds, defensive rebounds, and total rebounds are 0.812931, 2.519828, and 3.331650 respectively while the standard deviations are 0.744196, 1.790656, and 2.352818 respectively.

The distributions of the variables offensive rebounds, defensive rebounds, and total rebounds imply the following:

  • In the distribution of offensive rebounds, most NBA players get an average of approximately zero to one offensive rebound but with a long tail of players who get more offensive rebounds
  • In the distribution of defensive rebounds, most NBA players get an average of approximately zero to four defensive rebounds but with a long tail of players who get more defensive rebounds
  • In the distribution of total rebounds, most NBA players get an average of approximately zero to six total rebounds but with a long tail of players who get more total rebounds

Other Metrics (Assists, Steals, Blocks, Turnovers, Personal Fouls Drawn)

The variables assists, steals, blocks, and turnovers seem to resemble a positively (right) skewed distribution while the variable personal fouls drawn seems to mostly resemble a normal distribution.

The mean of the variables assists, steals, blocks, turnovers, and personal fouls drawn are 1.808251, 0.582759, 0.353571, 0.978695, and 1.564655 respectively while the standard deviations are 1.838080, 0.425452, 0.360811, 0.817941, and 0.826783 respectively.

The distributions of the variables games played, games started, and minutes played imply the following:

  • In the distribution of assists, most NBA players have average of approximately zero to four assists but with a long tail of players who get more assists
  • In the distribution of steals, most NBA players get an average of approximately zero to one steal but with a long tail of players who get more steals
  • In the distribution of blocks, NBA players on average have zero to one block but with a long tail of players who get more blocks.
  • In the distribution of turnovers, NBA players on average have zero to two turnovers but with a long tail of players who have more turnovers
  • In the distribution of personal fouls drawn, NBA players on average draw two personal fouls with some players drawing more personal fouls and some players drawing fewer personal fouls

Histograms — Postseason

Playing Time

The variables games played and games started seem to resemble a positively (right) skewed distribution while the variable minutes played seems to resemble a multimodal distribution.

The mean of the variables games played, games started, and minutes played are 8.714286, 4.009217, and 19.429032 respectively while the standard deviation are 5.802412, 5.944178, and 12.879892 respectively.

The distributions of the variables games played, games started, and minutes played are the following:

  • In the distribution of games played, most NBA players play an average of approximately zero to ten games but with a long tail of players who play more games
  • In the distribution of games started, NBA players start an average of approximately zero to five games but with a long tail of players who start more games
  • In the distribution of minutes played, multiple peaks are present. In particular, one such peak is present between approximately zero and five minutes while other peaks are present at approximately ten minutes, twenty five minutes, and forty minutes.

Shooting Performance (Field Goals)

The variables field goals made, and field goals attempted seem to resemble a positively (right) skewed distribution while the variables field goals percentage and effective field goals percentage seem to mostly resemble a normal distribution albeit with some outliers present on the left hand side and right hand side.

The mean of the variables field goals made, field goals attempted, field goals percentage and effective field goals percentage are 3.045161, 6.737788, 0.437516, and 0.504396 respectively while the standard deviations are 2.699843, 5.865455, 0.184581, and 0.205770 respectively.

The distributions of the variables field goals made, field goals attempted, field goal percentage and effective field goal percentage imply the following:

  • In the distribution of field goals made, most NBA players make an average of approximately zero to five field goals but with a long tail of players who make more field goals
  • In the distribution of field goals attempted, most NBA players attempt an average of approximately zero to ten field goals but with a long tail of players who attempt more field goals
  • In the distribution of field goal percentage, NBA players on average have a field goal percentage of approximately 0.50 (50%) with some players having a higher field goal percentage and some players having a lower field goal percentage. Its worth noting that approximately fifteen players have a field goal percentage of zero (0%) while a little more than five players have a field goal percentage of exactly one (100%)
  • In the distribution of effective field goal percentage, NBA players on average have an effective field goal percentage of approximately 0.50 (50%) with some players having a higher effective field goal percentage and some players having a lower effective field goal percentage. Its worth noting that approximately fifteen players have an effective field goal percentage of zero (0%), five players have an effective field goal percentage of exactly 1 (100%), and very few players have an effective field goal percentage of 1.5 (150%)

Shooting Performance (Three Pointers)

The variables three pointers made, and three pointers attempted seem to resemble a positively (right) skewed distribution while the variable three point percentage seems to mostly resemble a normal distribution albeit with some outliers present on the left hand side and right hand side.

The mean of the variables three pointers made, three pointers attempted, and three pointers percentage are 0.934562, 2.700461, and 0.273318 respectively while the standard deviations are 0.969323, 2.538454, and 0.201061 respectively.

The distributions of the variables three pointers made, three pointers attempted, and three pointer percentage imply the following:

  • In the distribution of three pointers made, most NBA players make an average of approximately zero to one three pointer shots but with a long tail of players who make more three pointers
  • In the distribution of three pointers attempted, most NBA players attempt an average of approximately zero to four three pointer shots but with a long tail of players who attempt more three pointers
  • In the distribution of three point percentage, multiple peaks are present. In particular, peaks are present at approximately 0 (0%), 0.2–0.4 (20% — 40%), and 0.4 (40%)

Shooting Performance (Two Pointers)

The variables two pointers made, and two pointers attempted seem to resemble a positively (right) skewed distribution while the variable two pointers percentage seems to mostly resemble a normal distribution albeit with some outliers present on the left hand side and right hand side.

The mean of the variables two pointers made, two pointers attempted, and two pointer percentage are 2.111982, 4.037788, and 0.495000 respectively while the standard deviations are 2.160235, 4.094277, and 0.222443 respectively.

The distributions of the variables two pointers made, two pointers attempted, and two pointer percentage imply the following:

  • In the distribution of two pointers made, most NBA players make an average of approximately zero to two two pointer shots but with a long tail of players who make more two pointers
  • In the distribution of two pointers attempted, most NBA players attempt an average of approximately zero to five two pointer shots but with a long tail of players who attempt more two pointers
  • In the distribution of two point percentage, NBA players on average have a two point percentage of approximately 0.50 (50%) with some players having a higher two point percentage and some players having a lower two point percentage. Its worth noting that a little more than twenty players have a two point percentage of zero (0%) while ten players have a two point percentage of exactly one (100%)

Shooting Performance (Free Throws)

The variables free throws made, and free throws attempted seem to resemble a positively (right) skewed distribution while the variable free throw percentage seems to mostly resemble a negatively (left) skewed distribution albeit with some outliers present on the right hand side.

The mean of the variables free throws made, free throws attempted, and free throw percentage are 1.442396, 1.833180, and 0.623249 respectively while the standard deviations are 1.795919, 2.216793, and 0.344243 respectively.

The distributions of the variables free throws made, free throws attempted, and free throw percentage imply the following:

  • In the distribution of free throws made, most NBA players make an average of approximately zero to two free throws but with a long tail of players who make more free throws
  • In the distribution of free throws attempted, most NBA players attempt an average of approximately zero to three free throws but with a long tail of players who attempt more free throws
  • In the distribution of free throw percentage, most NBA players have an average free throw percentage of approximately 0.8 (80%) but with a long tail of players who have a lower free throw percentage. It’s worth noting that a little more than forty players have a free throw percentage of 0 (0%)

Rebounding Performance (Offensive, Defensive, Team)

The variables offensive rebounds, defensive rebounds, and total rebounds seem to resemble a positively (right) skewed distribution.

The mean of the variables offensive rebounds, defensive rebounds, and total rebounds are 0.773733, 2.626728, and 3.404147 respectively while the standard deviations are 0.848965, 2.203811, and 2.842360 respectively.

The distributions of the variables offensive rebounds, defensive rebounds, and total rebounds imply the following:

  • In the distribution of offensive rebounds, most NBA players get an average of approximately zero to one offensive rebound but with a long tail of players who get more offensive rebounds
  • In the distribution of defensive rebounds, most NBA players get an average of approximately zero to four defensive rebounds but with a long tail of players who get more defensive rebounds
  • In the distribution of total rebounds, most NBA players get an average of approximately zero to five total rebounds but with a long tail of players who get more total rebounds

Other Metrics (Assists, Steals, Blocks, Turnovers, Personal Fouls Drawn)

The variables assists, steals, blocks, and turnovers seem to resemble a positively (right) skewed distribution while the variable personal fouls drawn seems to mostly resemble a multimodal distribution.

The mean of the variables assists, steals, blocks, turnovers, and personal fouls drawn are 1.828571, 0.584793, 0.361290, 1.085714, and 1.784332 respectively while the standard deviations are 2.007120, 0.499629, 0.453458, 1.126790, and 1.175811 respectively.

The distributions of the variables games played, games started, and minutes played imply the following:

  • In the distribution of assists, most NBA players have average of approximately zero to four assists but with a long tail of players who get more assists
  • In the distribution of steals, most NBA players get an average of approximately zero to one steal but with a long tail of players who get more steals
  • In the distribution of blocks, NBA players on average have zero to one block but with a long tail of players who get more blocks.
  • In the distribution of turnovers, NBA players on average have zero to two turnovers but with a long tail of players who have more turnovers
  • In the distribution of personal fouls drawn, multiple peaks are present. In particular, peaks are present at approximately zero personal fouls drawn, two personal fouls drawn, and three personal fouls drawn

Correlation Analysis

Correlation matrices were generated to better understand the relationship between the variables of interest and the dependent (response) variable, points per game (PPG).

The correlation matrices will also be crucial in determining which variables of interest best predict PPG. In other words, the correlation matrices will be used to determine which variables of interest will end up being the independent variables in the regression model.

Its also worth noting that variables that either have a correlation greater than 0.3 or less than -0.3 are suitable variables for predicting PPG since a correlation of 0.3 indicates a moderate positive relationship while a correlation of -0.3 indicates a moderate negative relationship

Regular Season

Playing Time

For playing time, the correlation between PPG and the following variables was determined: games played, games started, and minutes played.

As indicated by their respective correlation values, games played, games started, and minutes played all have a strong positive relationship with PPG.

As the correlation values for games played, games started, and minutes played are all greater than 0.3, games played, games started, and minutes played are suitable variables for predicting PPG

Shooting Performance (Field Goals)

For shooting performance (field goals), the correlation between PPG and the following variables was determined: field goals made, field goals attempted, field goal percentage, and effective field goal percentage.

As indicated by their respective correlation values, field goals made and field goals attempted both have a strong positive relationship with PPG while field goal percentage, and effective field goal percentage have a moderate positive relationship with PPG.

As the correlation values for field goals made, field goals attempted, field goal percentage, and effective field goal percentage are all greater than 0.3, field goals made, field goals attempted, field goal percentage, and effective field goal percentage are suitable variables for predicting PPG

Shooting Performance (Three Pointers)

For shooting performance (three pointers), the correlation between PPG and the following variables was determined: three pointers made, three pointers attempted, and three point percentage.

As indicated by their respective correlation values, three pointers made, and three pointers attempted both have a strong positive relationship with PPG while three point percentage has a moderate positive relationship with PPG.

As the correlation values for three pointers made, three pointers attempted, and three point percentage are all greater than 0.3, three pointers made, three pointers attempted, and three point percentage are suitable variables for predicting PPG.

Shooting Performance (Two Pointers)

For shooting performance (two pointers), the correlation between PPG and the following variables was determined: two pointers made, two pointers attempted, and two point percentage.

As indicated by their respective correlation values, two pointers made, and two pointers attempted both have a strong positive relationship with PPG while two point percentage has a weak positive relationship with PPG.

As the correlation values for two pointers made and two pointers attempted are both greater than 0.3 two pointers made, and two pointers attempted are suitable variables for predicting PPG. However, two point percentage is not a suitable variable for predicting PPG.

Shooting Performance (Free Throws)

For shooting performance (free throws), the correlation between PPG and the following variables was determined: free throws made, free throws attempted, and free throw percentage.

As indicated by their respective correlation values, free throws made, free throws attempted, and free throw percentage all have a strong positive relationship with PPG.

As the correlation values for three pointers made, three pointers attempted, and three point percentage are all greater than 0.3, free throws made, free throws attempted, and free throw percentage are suitable variables for predicting PPG.

Rebounding Performance (Offensive, Defensive, Team)

For rebounding performance (offensive, defensive, team), the correlation between PPG and the following variables was determined: offensive rebounds, defensive rebounds, and team rebounds.

As indicated by their respective correlation values, offensive rebounds, and defensive rebounds both have a strong positive relationship with PPG while team rebounds has a moderate positive relationship with PPG.

As the correlation values for three pointers made, three pointers attempted, and three point percentage are all greater than 0.3, offensive rebounds, defensive rebounds, and team rebounds are suitable variables for predicting PPG.

Other (Assists, Steals, Blocks, Turnovers, Personal Fouls Drawn)

For other metrics, the correlation between PPG and the following variables was determined: assists, steals, blocks, turnovers, and personal fouls drawn.

As indicated by their respective correlation values, assists, steals, blocks, turnovers, and personal fouls drawn all have a strong positive relationship with PPG.

As the correlation values for assists, steals, blocks, turnovers, and personal fouls drawn are all greater than 0.3, assists, steals, blocks, turnovers, and personal fouls drawn are suitable variables for predicting PPG.

Postseason

Playing Time

For playing time, the correlation between PPG and the following variables was determined: games played, games started, and minutes played.

As indicated by their respective correlation values, games started, and minutes played both have a strong positive relationship with PPG but games played has a weak positive relationship with PPG.

As the correlation values for games started, and minutes played are both greater than 0.3, games started, and minutes played are suitable variables for predicting PPG. However, games played is not a suitable variable for predicting PPG

Shooting Performance (Field Goals)

For shooting performance (field goals), the correlation between PPG and the following variables was determined: field goals made, field goals attempted, field goal percentage, and effective field goal percentage.

As indicated by their respective correlation values, field goals made and field goals attempted both have a strong positive relationship with PPG while field goal percentage, and effective field goal percentage have a negligible relationship with PPG.

As the correlation values for field goals made and field goals attempted are both greater than 0.3, field goals made and field goals attempted are suitable variables for predicting PPG. However, field goal percentage, and effective field goal percentage are not suitable variables for predicting PPG

Shooting Performance (Three Pointers)

For shooting performance (three pointers), the correlation between PPG and the following variables was determined: three pointers made, three pointers attempted, and three point percentage.

As indicated by their respective correlation values, three pointers made, and three pointers attempted both have a strong positive relationship with PPG while three point percentage has a weak positive relationship with PPG.

As the correlation values for three pointers made and three pointers attempted are both greater than 0.3, three pointers made and three pointers attempted are suitable variables for predicting PPG. However, three point percentage isn’t a suitable variable for predicting PPG

Shooting Performance (Two Pointers)

For shooting performance (two pointers), the correlation between PPG and the following variables was determined: two pointers made, two pointers attempted, and two point percentage.

As indicated by their respective correlation values, two pointers made, and two pointers attempted both have a strong positive relationship with PPG while two point percentage has a weak positive relationship with PPG.

As the correlation values for two pointers made and two pointers attempted are both greater than 0.3 two pointers made, and two pointers attempted are suitable variables for predicting PPG. However, two point percentage is not a suitable variable for predicting PPG.

Shooting Performance (Free Throws)

For shooting performance (free throws), the correlation between PPG and the following variables was determined: free throws made, free throws attempted, and free throw percentage.

As indicated by their respective correlation values, free throws made, free throws attempted, and free throw percentage all have a strong positive relationship with PPG.

As the correlation values for three pointers made, three pointers attempted, and three point percentage are all greater than 0.3, free throws made, free throws attempted, and free throw percentage are suitable variables for predicting PPG.

Rebounding Performance (Offensive, Defensive, Team)

For rebounding performance (offensive, defensive, team), the correlation between PPG and the following variables was determined: offensive rebounds, defensive rebounds, and team rebounds.

As indicated by their respective correlation values, offensive rebounds, and defensive rebounds both have a strong positive relationship with PPG while team rebounds has a moderate positive relationship with PPG.

As the correlation values for three pointers made, three pointers attempted, and three point percentage are all greater than 0.3, offensive rebounds, defensive rebounds, and team rebounds are suitable variables for predicting PPG.

Other (Assists, Steals, Blocks, Turnovers, Personal Fouls Drawn)

For other metrics, the correlation between PPG and the following variables was determined: assists, steals, blocks, turnovers, and personal fouls drawn.

As indicated by their respective correlation values, assists, steals, blocks, turnovers, and personal fouls drawn all have a strong positive relationship with PPG.

As the correlation values for assists, steals, blocks, turnovers, and personal fouls drawn are all greater than 0.3, assists, steals, blocks, turnovers, and personal fouls drawn are suitable variables for predicting PPG.

Regression Analysis

Now that the EDA portion has been completed, the last step is to perform a regression analysis in order to determine the best performing model and ultimately which model best predicts points per game (PPG), the dependent (response) variable.

Based on the results of the correlation analysis, the following variables were chosen as independent (predictor) variables for predicting regular season PPG. A total of twenty three such independent variables were chosen to predict PPG

  • games played, games started, minutes played
  • field goals made, field goals attempted, field goal percentage, effective field goal percentage
  • three pointers made, three pointers attempted, three point percentage
  • two pointers made, two pointers attempted
  • free throws made, free throws attempted, free throw percentage
  • offensive rebounds, defensive rebounds, team rebounds
  • assists, steals, blocks, turnovers, and personal fouls drawn.

Likewise, based on the results of the correlation analysis, the following variables were chosen as independent (predictor) variables for predicting postseason PPG:

  • games started, minutes played
  • field goals made, field goals attempted
  • three pointers made, three pointers attempted
  • two pointers made, two pointers attempted
  • free throws made, free throws attempted, free throw percentage
  • offensive rebounds, defensive rebounds, team rebounds
  • assists, steals, blocks, turnovers, and personal fouls drawn.

As part of the regression analysis, a total of fourteen initial models were created to predict PPG for both the regular season and postseason: seven models for predicting regular season PPG (Models 1–7) and seven models for predicting postseason PPG (Models 8–14).

In addition, six additional models were created in order to evaluate if initial model performance could be improved: three models for predicting regular season PPG (Models 15–17) and another three models for predicting postseason PPG (Models 18–20).

In total, twenty such models were created to predict regular season and postseason PPG.

In order to evaluate model performance, regression metrics such as RMSE and R-Squared were used and the model that ideally has the lowest RMSE and highest R-Squared will be chosen as the model of choice for predicting regular and postseason PPG.

A low RMSE indicates that the model fits the data well while a high R-Squared indicates that the model performed well. Since a lot of models were formulated as part of the analysis, only the best performing models with respect to RMSE and R-Squared will be highlighted.

Regular Season PPG

With respect to R-Squared and RMSE, Model 15 (PPG vs Entire Shooting Performance) was the best performing models for predicting regular season PPG as it had the highest R-Squared (99.99%) and the lowest RMSE (0.07).

Model 17 (PPG vs Variables from Best Performing Models based on R-Squared) and Model 2 (PPG vs Shooting Performance — Field Goals) also performed well and seemed to fit the data well as indicated by their respective R-Squared and RMSE values: 98.67% and 0.71 (Model 17), 98.55% and 0.74 (Model 2).

On the other hand, the worst performing model for predicting regular season PPG is Model 3 (PPG vs Shooting Performance — Three Pointers) as indicated by its respective R-Squared and RMSE values: 53.98% and 4.16.

The implication of the R-Squared values of Models 15, 16, 17, 2, and 3 are as follows:

  • Model 15: 99.99% of the variation in the dependent variable PTS can be explained in terms of the independent variables: field goals made, field goals attempted, field goals percentage, effective field goals percentage, three pointers made, three pointers attempted, three pointers percentage, two pointers made, two pointers attempted, free throws made, free throws attempted, free throws percentage
  • Model 16: 99.99% of the variation in the dependent variable PTS can be explained in terms of the independent variables: games played, games started, minutes played, field goals made, field goals attempted, field goals percentage, effective field goals percentage, three pointers made, three pointers attempted, three pointers percentage, two pointers made, two pointers attempted, free throws made, free throws attempted, free throws percentage, offensive rebounds, defensive rebounds, team rebounds, assists, turnovers, blocks, personal fouls drawn, steals
  • Model 17: 98.67% of the variation in the dependent variable PTS can be explained in terms of the independent variables: games played, games started, minutes played, field goals made, field goals attempted, field goals percentage, effective field goals percentage, two pointers made, two pointers attempted
  • Model 2: 98.55% of the variation in the dependent variable PTS can be explained in terms of the independent variables: field goals made, field goals attempted, field goals percentage, effective field goals percentage
  • Model 3: 53.98% of the variation in the dependent variable PTS can be explained in terms of the independent variables: three pointers made, three pointers attempted, three pointers percentage

Postseason PPG

Likewise, with respect to R-Squared and RMSE, Model 18 (PPG vs Entire Shooting Performance), Model 19 (PPG vs Playing Time, Shooting, Rebounding, Other Metrics), and Model 20 (PPG vs Variables from Best Performing Models based on R-Squared) were the best performing models for predicting postseason PPG as they all had the highest R-Squared (99.99%) and the lowest RMSE (0.06 for Models 18 and 19, 0.07 for Model 20).

Model 9 (PPG vs Shooting Performance — Field Goals) also performed well and seemed to fit the data well as indicated by its respective R-Squared and RMSE values: 98.38% and 0.97.

On the other hand, the worst performing model for predicting postseason PPG is Model 10 (PPG vs Shooting Performance — Three Pointers) as indicated by its respective R-Squared and RMSE values: 56.49% and 5.02.

The implication of the R-Squared values of Models 18, 19, 20, 9, and 10 are as follows:

  • Model 18: 99.99% of the variation in the dependent variable PTS can be explained in terms of the independent variables: field goals made, field goals attempted, three pointers made, three pointers attempted, two pointers made, two pointers attempted, free throws made, free throws attempted, free throws percentage
  • Model 19: 99.99% of the variation in the dependent variable PTS can be explained in terms of the independent variables: games started, minutes played, field goals made, field goals attempted, three pointers attempted, three pointers made, two pointers made, two pointers attempted, free throws made, free throws attempted, free throws percentage, offensive rebounds, defensive rebounds, team rebounds, assists, turnovers, blocks, personal fouls drawn, steals
  • Model 20: 99.99% of the variation in the dependent variable PTS can be explained in terms of the independent variables: field goals made, field goals attempted, two pointers made, two pointers attempted, free throws made, free throws attempted, free throws percentage
  • Model 9: 98.38% of the variation in the dependent variable PTS can be explained in terms of the independent variables: field goals made and field goals attempted
  • Model 10: 56.49% of the variation in the dependent variable PTS can be explained in terms of the independent variables: three pointers attempted and three pointers made

--

--

Anaswar Jayakumar

Data Scientist - Leverages data science and statistical techniques to make recommendations that align with business priorities.