Data Analysis on NBA comeback victories

Evangelos Zafeiratos
6 min readJan 31, 2022

Data Analysis Framework with complete ETL using Python for Web Scraping, Pandas for transforming and loading data and finally Visme to visualize key insights.

NBA Full Court View
Photo by Miltiadis Fragkidis on Unsplash

One of the few sport events I follow is NBA league and besides enjoying the game of basketball I very often find myself digging into stats of players or teams.
Last week on January 25 there was an astonishing event, of the ones called outliers in the statisticians world : the Los Angeles Clippers pulled off an amazing comeback from -35 points to win the game against the Washington Wizards.
As amazing this result truly were, my gut feeling from watching the game for many years was that comeback victories is not such a rare occasion as it used to be in the past. I decided to run a detailed analysis to put my theory under the infallible microscope of exploratory analysis.

Gathering the Data

In order to get the data I need for the analysis, I had to go to the source. https://www.basketball-reference.com/ is a website which hosts tons of data when it comes to sports events including Play-by-Play information from every single NBA game since season 1996–1997.

Next step was to create a script to pull all this data for further analysis. After examining carefully the website sitemap i created a Python function which builds a list of all Game URLs for a single season:

def extractGameURLs(year):
URLs_list = list()
domain_URL = “https://www.basketball-reference.com/leagues/NBA_"
year_URL = domain_URL + str(year) + “_games-”
months = [‘october’, ‘november’, ‘december’, ‘january’, ‘february’, ‘march’, ‘april’, ‘may’, ‘june’ ,’july’]
for month in months:
URL = year_URL + month + “.html”
page = requests.get(URL)
soup = BeautifulSoup(page.content, “html.parser”)
schedule_div = soup.find(id=”schedule”)
schedule_str = str(schedule_div)
# We are looking to match the string “201410280SAS” that is used in creating the NBA boxscores URL
regex = ‘\d{9}\w{3}’
match = re.findall(regex,schedule_str)
for item in (set(match)):
URLs_list.append(item)
URLs_list.sort()
return URLs_list

Next, BeautifulSoup & Python libraries would need to take over in order to convert all this HTML into a structured data table. I won’t get into the tiniest detail of the implementation, I will just present you the main function that handles the URL provided by the previous extractGameURLs() function and utilizes more functions for each specific analysis.
[ You can find the full Source code in the corresponding Github repository ]

def pbpRead(pbp_URL):
httpString = 'https://www.basketball reference.com/boxscores/pbp/'
finalURL = httpString + pbp_URL + '.html'
page = requests.get(finalURL)
soup = BeautifulSoup(page.content,"html.parser")
a = soup.find_all(re.compile("game_summaries"))
URL = finalURL[-31:]
homeTeam, awayTeam = teamNames(soup)
gameType = gameTypeDecider(soup)
date, time, location = locationGameTime(soup)
winningTeam = findWinner(soup,homeTeam, awayTeam)
fixedList = [URL,gameType,location,date,time,winningTeam,awayTeam,homeTeam]
playByPlay(soup,fixedList)

Last, but not least, a for loop automates the script and calls each of the previously mentioned functions including a writeCsvHeader(fileName) function, which has the mission of printing all the gameplays for an entire season on seperate CSV files.

After the script running successfully (I didn’t bother for any code optimization and the volume of data was quite large, so it took roughly a day) we have the following files made available in our directory :

Pbp NBA data since 1996–1997 Season

Transforming the Data

My main area of interest on the play-by-play dataset of NBA games is the comeback victories so the data needs to be transformed accordingly.
I utilized of course the powerful Pandas library to achieve this.

Rather than going into the details of the Pandas Source code (which you can also find in the Project Repository) I will explain the thinking behind my analysis in which i opted for the following 3 metrics :

  • Ratio of Comeback victories from 10+ points
  • Ratio of Comeback victories from 20+ points
  • Average points on Comeback victories

[ Ratio is the number of games that meet the requirement divided by the number of season games and it consists of a more safe metric than absolut number of games due to seasons with less games either due to lockout or Covid ].

I decided to extend the metric above to detect comeback victories from a large deficit on 4th quarter, because covering a point disadvantage of 20+ points that occurs during 1st or 2nd Quarter is a great athletic feat, but when a team achieves this during the 4th Quarter with less than 12' remaining in the game clock, this is extraordinary.

On top of the metrics, I wanted to extract the record comeback from each season as this is an interesting insight.

My final version of data which I use for the visualization is 2 dataframes (one for each set of metrics) which are initiated in this piece of code :

NbaData = { 'Comeback 10Points+' : plus10PointsRatio, 'Comeback 20Points+' : plus20PointsRatio, 'Average Comeback' : avgComeback, 'YearRecord' : yearlyComeback}
if quarter == 4 :
NbaSeasonQ4Df = pd.DataFrame(data = NbaData, index = seasonList)
return NbaSeasonQ4Df
else:
NbaSeasonDf = pd.DataFrame(data = NbaData, index = seasonList)
return NbaSeasonDf

..and they look something like this :

Dataframe from NBA Comeback Victories metrics
Dataframe with NBA Comeback Victories Data

Loading & Visualizing Data

Moving to the part that is more interesting to the NBA fans and stat lovers, displaying our findings!
I decided to use the free version of Visme for visualizing my data. This is a tool with limited data manipulation options, but I chose it because of the design aesthetics and moreover since my dataset is not dynamic, I saw this as an one-off analysis:

Visualization 1 :

It is confirmed that comeback victories are more common these days, as shown in the first viz : not only we can see the trend increases, but there is also a significant increase on the ratio of 10+ points comeback victories since season 2016–2017 when it was the first time the ratio surpassed 25% meaning 1/4 teams trailing for 10 or more during the game would end up winning!

10+ Points Comeback Victories Ratio
10+ Points Comeback Victories Ratio

Visualization 2 :

The metric of 20+ points comeback victories tells no different story : we have 3 consecutive seasons during 2017–2020 where the ratio surpasses 2% for the first time. In absolute terms, the number of games where 20+ points comeback victories occured for these seasons where 3 and 4 times the number of games in seasons during late 90’s and early 00’s .

20+ Points Comeback Victories Ratio
20+ Points Comeback Victories Ratio

Visualization 3 :

10+ points comeback victories on 4th quarter confirm the trend. From having multiple seasons around 3% ratio in the past, during the last few seasons this percentage is well above 4% (never surpassed the 5% threshold though) .

Visualization 4 :

The following is a tribute to the teams that achieved the almost impossible. A summary of the greatest comeback victories per NBA Season since 96–97.

Greatest comebacks by NBA Season

Visualization 5 :

Finally, the overview of greatest 4th Quarter comeback victories per Season. Some of the entries in this table are simply amazing, with the
Dallas Mavericks 103 @ Los Angeles Lakers 105 game being the most outstanding. The team from L.A. wrote history when on December 6, 2002 they were down by 27 on 4th quarter and astonishingly came back to win the game.

More insights on NBA games soon.

Thanks for reading!

--

--