Lessons From NBA Play-By-Play Data — Part I (Basics)

Cooperchia
6 min readSep 26, 2020

--

As an avid fan of the NBA, box scores are, by far, the quickest and easiest way to gain an understanding of the game action that took place. Points, rebounds, assists, and even slightly more advanced stats like +/- can be gleaned from the box score, both from an individual and a team perspective. But what can’t the box score tell us?

Let’s start with Points Per Possession (PPP), a foundation of any league-wide, team, or individual analysis. In terms of measuring scoring ability, PPP is a more accurate quantifier than points-per-game or points-per-minute because it accounts for the difference in minutes-played-per-game (on an individual level) and also accounts for the variation in pace-of-play (on both individual/team levels).

So how do we calculate PPP? Calculating points is easy, but possessions are more difficult to assess. There is still no official way of determining what constitutes a possession (seriously), so I decided to use a formula I saw used by Nylon Calculus a few years back:

Total Possessions = FGAs + FT Pairs and Triplets + TOVs — ORBs

*FGA = Field Goal Attempt

*FT = Free Throw

*TOV = Turnover

*ORB = Offensive Rebound

The first 3 terms of the formula increment the number of possessions, while an offensive rebound decrements the counter since it extends a team’s possession. Notice how only FT Pairs and Triplets are counted, as and-1 FTs would overcount a possession since FGAs are already accounted for. Additionally, our ORB counter must be tweaked slightly to avoid counting team-offensive-rebounds that occur on missed FTs and at the end of quarters (more details can be found here).

You’ve probably guessed that the formula for possessions contains too many subtleties to be computed using only the box score.

Enter NBA Play-By-Play (PBP) Data.

Play-by-play data records every single play that occurs in every basketball game. The dataset I used (via Kaggle) contains full play-by-play data for the last 4 seasons (2016–2017 to 2019–2020).

To calculate PPP, I wrote the following code to implement the formula above, using Python and Pandas:

*Note: All analysis done for this article was league-wide analysis.

I got the league-wide points per possession to be as follows:

2019–2020: 1.1122

2018–2019: 1.1110

2017–2018: 1.0935

2016–2017: 1.0951

For reference, these numbers are slightly higher than PPP calculated on NBA.com or ESPN, both of which use estimates derived from box score (not PBP) stats.

After calculating PPP, I was curious as to how this number would vary depending on different types of field goals attempted within a possession. Are 3-pointers really more valuable than 2-pointers? Is the midrange dead? To determine this, I first used the play-by-play data to compute the number of FG attempts and number of FG makes from different distances from the basket (removing end-of-quarter heaves from consideration). After gathering the necessary shot data, I used Matplotlib to visualize the results from each of the previous 4 seasons:

First, we have the number of shots attempted from different distances, with 2-pointers in blue and 3-pointers in pink. We can see that, consistently, 2-pointers within 5 feet of the basket and 3-pointers in general are most popular (an interesting side note is the dramatic increase in the number of long 3s attempted, but we’ll leave that analysis for another day).

This would seemingly support the theory that the midrange 2-point shot is becoming increasingly less relevant. Is there good reason for this? Let’s take a look at field goal percentages from each of the same class of shots described above:

Perhaps these results are surprising to you, as they were to me. We can see that shots taken within 5 feet of the basket consistently yield around 60% accuracy, but from 6 feet to 30 feet, league-wide shooting percentages remain relatively constant (somewhere around 35–40%). When we account for the differences in point value for a 2-point make and 3-point make, we get the following expected values for each type of shot:

We can see that, with league-average shooting percentages, most 3-pointers indeed represent a much better value than all 2-pointers outside of 5 feet. Another interesting observation is the slightly better efficiency of 11–15 foot shots than 6–10 foot shots. I would assume there are a number of reasons for this, but, again, I’ll leave this analysis for a different day.

Now that we have the expected value of each shot, I wondered, what if offensive rebounding rates varied for different shot distances? As we discussed earlier, an ORB extends a possession, so if more ORBs are grabbed for a certain type of shot, surely that would increase the shot’s value. So I wrote some more code to compute the percentage of misses that result in an ORB for each shot distance, yielding the following results:

The general trend seems to be that a miss closer to the basket is more likely to result in an offensive rebound. This seems to raise the value of layups even more, but wait. Let’s recall that the probability of a missed shot inside of 5 feet is much lower than a missed shot outside of 5 feet. Instead, we should be looking at the percentage of shot attempts (not misses) that result in an ORB:

Surprised by the results? I was too. We can clearly see that 2-point shot attempts within 6–10 feet have the highest probability of generating an offensive rebound. Now how does this affect the expected values of shots we calculated earlier? To get an idea of this, I added the product of ORB% and PPP to the expected values determined previously, since an ORB extends/resets a possession. Of course, there are some details that get missed by using such a simple calculation. For example, 3-pointers most likely generate more long ORBs, which are probably less likely to generate second-chance points than an ORB grabbed right under the basket. Nonetheless, this updated formula should still be able to estimate how ORBs affect the value of a shot:

Nothing too surprising here. We can see that layups and 3-pointers are still the most valuable. However, I noticed that 6–10 foot shots are now slightly more valuable than 11–15 foot shots, contrary to the expected values computed without accounting for ORBs. This makes sense because we saw that 6–10 foot shot attempts are most likely to get offensively rebounded!

That’s it for Part I of my Play-By-Play Data Analysis. I hope I was able to demonstrate some of the basic but interesting things that can be uncovered using PBP data. The next steps of my analysis may include classifying shots by both distance AND location, team and individual analysis, and a more thorough investigation of points generated after an ORB. If you would like to check out the code I wrote for this analysis, you can find it here: https://github.com/crchia/nba-pbp-data-analysis.

If you have any questions or suggestions, I’d love to hear them in the comments. If you enjoyed the piece, go ahead and hit the clap button!

--

--