Born to Assist: NBA Players Spreading the Love on their Birthdays

Analytics Exploration of NBA Performance on Players’ Birthdays

Charlie Samuels
13 min readAug 28, 2023

Growing up, playing on my middle school basketball team with NBA aspirations but without the physical prowess or skill set to justify them, I always loved playing on my birthday. Perhaps it was the presence of my parents cheering me on or maybe it was just the surge of confidence that comes from being the center of attention, if only for a day. Whatever the reason, some of my most memorable performances happened on those special days.

Throughout the years, researchers and sports enthusiasts alike have debated the impact of external, qualitative factors on the performance of professional athletes, and the differences between meaningful and trivial influences. While some elements, like home field advantage, seem noncontroversial in their psychological influence, others — famously the “hot hand” phenomenon, for example — remain the subjects of heated debate (pun intended).

I’ve often wondered if the enhanced performances I experienced (or just perceived?) on my birthday reflected a broader trend in professional sports. Put differently, does a personal connection to a particular day — even if it has no link to the sport itself — affect elite athletes who are trained to stay focused and filter out potential distractions just as it did for me? Or are they “immune,” so to speak, from such influences?

To take a crack at exploring this question, I’ve analyzed the performances of NBA players from the 2022–2023 season, investigating how each of the season’s 539 players performed on their respective birthdays (should they have had a game that day). What follows below is a sketch of the Python and R steps used to both scrape the relevant data from basketball-reference.com (disclaimer: always web scrape responsibly and in line with sites’ robots.txt guidelines— and huge thanks to basketball-reference.com for being an invaluable resource!) and analyze it for birthday trends. I used Python for the scraping because of its more robust scraping packages but R for the analysis because I’m partial to RStudio’s visual interface. I hope you enjoy this write-up and find it interesting and/or educational; I hope to look back in a year and only notice all of the things I should have done differently!

Scraping Data with Python

For this project, I had two distinct data needs. First, I needed to build a dataset of every player stat line from the 2022–2023 NBA season: given the season’s 82 games, 30 teams, and teams’ tendencies to play roughly 10–11 players a game, this dataframe would ultimately be over 25,000 observations long. Second, I needed to build a dictionary of sorts to map players to their birthdays. Ultimately, I sought to merge these two datasets and investigate both the frequency and quality of “birthday games” — that is, games played by players on their respective birthdays.

To begin, I imported the following basic Python libraries, with requests and BeautifulSoup being the most critical for this task:

Building Database of Individual Statlines

The data I wanted here spanned many pages on basketball-reference.com. Specifically, basketball-reference.com organizes their game logs by month, with individual “box score” links to each game played that month.

A glimpse of https://www.basketball-reference.com/leagues/NBA_2023_games-october.html

As you can see, each month’s sub-link on the top of the page showcases that month’s games, and each game has a link that navigates to a site that looks something like this:

Box score page from October 18, 2022 — the first game of the season, played by the Sixers and the Celtics (https://www.basketball-reference.com/boxscores/202210180BOS.html)

Each box score page houses both basic and advanced stats for each player on each team. To assemble the total dataset of stat lines from the entire season, I needed to iterate through each month’s list of games and grab each team’s basic stats box score. Attached are the four functions I used to do this:

Using these functions, running gather_all_stats(get_all_box_score_links()) returned a dataframe of nearly 32,000 observations (as will be shown later, 6,000 of these corresponded with players who failed to log time on a specific day but were nonetheless included in their team’s box score, leaving about 26,000 meaningful observations — as expected).

Building Birthday Mapping

To get the desired pairings of players and their birthdays, I needed to make use of basketball-reference.com’s “2022–23 NBA Player Stats: Per Game” page, on which the stats of each of the 539 players that played that season are displayed. Conveniently, each player’s name is hyperlinked to his respective player page, which, among other interesting biographical details, includes each player’s birthday.

Left: the list of last season’s 539 players on https://www.basketball-reference.com/leagues/NBA_2023_per_game.html (duplicates the result of team changes); Right: an example player page for Precious Achiuwa, whose birthday was prior to the season’s start on September 19.

To assemble the birthday dictionary, I needed to iterate through the player hyperlinks and grab each player’s birthday from his personal page. Attached are the three functions I used to do this:

Calling get_all_players_birthdays(“https://www.basketball-reference.com/leagues/NBA_2023_per_game.html”) returned the desired birthday pairs.

Finally, having built both the stats and birthdays datasets, I downloaded the data as two CSVs, to be analyzed in R.

Analyzing Data with R

I loaded both the Rio and Tidyverse libraries, for easy importing and more robust analysis functions, respectively (note: I had difficulty displaying my R code via Github’s Gist functionality as I did with the above Python code, so I’ve relied upon Medium’s default code embedding, which lacks much of the visual color clues afforded by the Github display method but hopefully still does the job).

library(rio)
library(tidyverse)

Importing and Investigating Birthday Data

I loaded in the data, cleaning it and separating the birthday variable into three distinct year, month, and day variables in the process:

# importing bday data
bdays <- import("Downloads/player_bdays.csv") %>%
# splitting date into year, month, and day variables
separate(
col = Birthday,
into = c("bday.year", "bday.month", "bday.day"),
remove = F
) %>%
mutate(across(starts_with("bday."), as.numeric)) %>%
rename(player = Player, bday.full = Birthday) %>%
mutate(bday.full = as.Date(bday.full))

The resulting bdays dataframe was, as expected, 539 observations long, and looked like this:

First eight birthday observations of 539

Then, I began exploring the data: 46 players shared an exact birthday, comprising 20 birthday “pairs” and two birthday “triplets”: Juancho Hernangómez, Caleb Martin, and Cody Martin were all born on September 28, 1995 (the latter two, of course, being actual twins), while Brandon Clarke, Dejounte Murray, and Chris Silva were all born on September 19, 1996.

# creating dataframe of players who share an exact birthday with at least 
# one other player
dup.bdays <-
bdays[
duplicated(bdays %>% select(-player)) |
duplicated(bdays %>% select(-player), fromLast = T),
] %>% arrange(bday.full)
# returning number of observations in said dataframe:
nrow(dup.bdays) # this returns 46

# grouping the dataframe by birthday and filtering to only birthdays that
# were celebrated by more than 2 players
dup.bdays %>%
group_by(bday.full) %>%
summarise(
bday.full = unique(bday.full),
bday.year = unique(bday.year),
bday.month = unique(bday.month),
bday.day = unique(bday.day),
num_players = length(player)
) %>%
filter(num_players > 2) # this returns the birthdays 9/28/95 and 9/19/96

# seeing which six players shared those two birthdays
dup.bdays %>%
filter(bday.full == "1995-09-28" | bday.full == "1996-09-19")
# this returns the aforementioned players

Colloquially speaking, though, when people share a birthday, we emphasize the “day” part — that is, we usually ignore year and care exclusively about month and day of birth. Expanding my parameters in this way, I discovered that of the 539 players, 406 (!) of them shared a birthday with at least one other player.

# creating dataframe of players who share a birth month and day
dup.bdays.any.year <-
bdays[
duplicated(bdays %>% select(-c(player, bday.full, bday.year))) |
duplicated(bdays %>% select(-c(player, bday.full, bday.year)), fromLast = T),
]

# calculating number of players in this dataframe
nrow(dup.bdays.any.year) # this returns 406

Finally, I calculated the number of birthdays celebrated during the season, i.e. between October 18 and April 9. Of the 539 players, 229 of them had in-season birthdays. This is intuitive: nearly half of last season’s players celebrated their birthdays during NBA season — which spans nearly half the year.

# creating variable for bday without year
bdays$month.day.of.bday <- format(bdays$bday.full, format=”%m-%d”)

# counting how many bdays celebrated in-season
nrow(bdays %>%
filter(month.day.of.bday >= “10–18” | month.day.of.bday <= “04–09”))
# this returns 229

Importing and Investigating 22–23 Season Stats Data

I began by making use of the rio package’s seamless importing capabilities again, reading in the scraped stats data.

# reading in stats data
stats <- import(“Downloads/player_stats.csv”)

This dataframe looked like this:

First 16 observations of 31,513

As suggested above, the data clearly includes players who did not play at all in certain games (highlighted by the reason variable). As well, the data suffers from many instances of missing or blank data. After removing the irrelevant players and filling in 0s for missing values, I was left with a dataframe of 25,895 clean observations.

# replacing blank or NA values with 0
stats[stats == “” | is.na(stats)] <- 0

# removing all players who didn't actually play
stats <- stats %>% filter(reason == “0”) %>% select(-reason)

# counting the number of players remaining after this
nrow(stats) # this returns 25,895

Merging Birthday and Stats Data and Investigating Birthday Performance

I merged the two dataframes very simply, relying on the shared player variable corresponding to player name.

# merging stats and bday data on "player" variable
stats.and.bdays <- inner_join(stats, bdays, by = "player") %>%
# converting game date variable to "Date" format
mutate(game.full = as.Date(date, "%m/%d/%y"), date = NULL) %>%
# splitting game date variables into year, month, and day variables
separate(
col = game.full,
into = c("game.year", "game.month", "game.day"),
remove = F
) %>%
mutate(
across(c("game.year", "game.month", "game.day"), as.numeric),
# adding date of when bday was celebrated during season
# (i.e. whether it was celebrated in 2022 or 2023)
season.bday.full =
update(bday.full,
year = ifelse(
(bday.month == 10 & bday.day >= 18) | bday.month > 10,
2022,
2023
)
)
)

This added nine birthday-related or date-related variables to the prior stats dataframe, putting all the important and carefully procured data into one place!

Newly-created variables in the merged dataframe

With the data ready to be analyzed, I checked to see how many birthday games were played last season. In other words, I wanted to know how many players of the 229 that celebrated in-season birthdays played a game on their birthday as a result of scheduling luck. The data showed that 76 birthday games were played, by players across all thirty teams (the Celtics had the largest number of these games with 5!).

# filtering stats data to only include stat lines recorded on the same day
# as the corresponding player's birthday
bday.games <- stats.and.bdays %>%
filter(bday.month == game.month & bday.day == game.day)

# counting the number of these games
nrow(bday.games) # this returns 76

# investigating the breakdown of these 76 birthday games by team
bday.games %>% count(team) %>% arrange((desc(n)))

To determine (finally!) how players performed on their birthdays, I needed to decide on baselines with which to compare birthday performances. I decided that for each major stat line (points, rebounds, assists, steals, and blocks), I’d calculate for each player both their season average and their running average (i.e. their average up until their birthday). This latter category was particularly intriguing to me because I wondered if one could realize, in the future, that a player was due to play on their birthday and exploit sportsbooks’ prop lines with my potential findings. Thus, I grouped the merged data by player and calculated each player’s two averages for each of the five categories. Notably, in calculating these averages, I also removed any players who did not celebrate game-birthdays.

player.avgs <- stats.and.bdays %>%
group_by(player) %>%
summarise(
# summarizing to player level
bday.year = unique(bday.year),
bday.month = unique(bday.month),
bday.day = unique(bday.day),

# points averages and birthday tallies
avg.pts = mean(pts),
on.bday.pts = pts[bday.day == game.day & bday.month == game.month],
avg.before.bday.pts = mean(pts[game.full < season.bday.full]),
bday.vs.running.pts = on.bday.pts - avg.before.bday.pts,
bday.vs.season.pts = on.bday.pts - avg.pts,

# assists averages and birthday tallies
avg.ast = mean(ast),
on.bday.ast = ast[bday.day == game.day & bday.month == game.month],
avg.before.bday.ast = mean(ast[game.full < season.bday.full]),
bday.vs.running.ast = on.bday.ast - avg.before.bday.ast,
bday.vs.season.ast = on.bday.ast - avg.ast,

# rebounds averages and birthday tallies
avg.trb = mean(trb),
on.bday.trb = trb[bday.day == game.day & bday.month == game.month],
avg.before.bday.trb = mean(trb[game.full < season.bday.full]),
bday.vs.running.trb = on.bday.trb - avg.before.bday.trb,
bday.vs.season.trb = on.bday.trb - avg.trb,

# steals averages and birthday tallies
avg.stl = mean(stl),
on.bday.stl = stl[bday.day == game.day & bday.month == game.month],
avg.before.bday.stl = mean(stl[game.full < season.bday.full]),
bday.vs.running.stl = on.bday.stl - avg.before.bday.stl,
bday.vs.season.stl = on.bday.stl - avg.stl,

# blocks averages and birthday tallies
avg.blk = mean(blk),
on.bday.blk = blk[bday.day == game.day & bday.month == game.month],
avg.before.bday.blk = mean(blk[game.full < season.bday.full]),
bday.vs.running.blk = on.bday.blk - avg.before.bday.blk,
bday.vs.season.blk = on.bday.blk - avg.blk,
)

The resulting dataframe is too wide to cleanly display here, so below is a glimpse of some of the many new variables calculated:

The first several new birthday comparison variables

To compare birthday stats to averaged stats, I used Paired T-Tests; I wanted to compare each birthday stat line in each of the five main stat categories with each of the two created average variables, totaling 10 possible tests. However, such tests require a normality assumption with regard to the differences between each tested pair of values. To vet these differences for normality, I created Normal QQ Plots for each of the 10 averages.

# list of new comparison vars
diff.vars <- names(player.avgs)[startsWith(names(player.avgs), "bday.vs")]

# plotting all the comparison vars to vet for normality
par(mfrow=c(2,5))
for (diff.var in diff.vars) {
qqnorm(player.avgs[[diff.var]], main = paste0(diff.var))
qqline(player.avgs[[diff.var]])
}
Normal QQ Plots for each of the 10 new average variables

Interpretations may vary here, but to my eye, only the points and assists comparisons appear normal. The others seem to diverge more prominently at the ends of the graphs, with the blocks comparisons not appearing even close to normality. Thus, I opted to conduct only four T-Tests: comparing birthday points and birthday assists to their respective running averages and season averages.

# listing stat categories to iterate thru (generalized code in case
# need to add more categories later on)
stat.cats <- c("pts", "ast")
results <- list()

for (stat in stat.cats) {
# on.bday vs avg.before.bday
test1 <-
t.test(
player.avgs[[paste0("on.bday.", stat)]],
player.avgs[[paste0("avg.before.bday.", stat)]],
paired=TRUE
)
results[[paste("on.bday", stat, "vs avg.before.bday", stat)]] <- test1

# on.bday vs avg
test2 <-
t.test(
player.avgs[[paste0("on.bday.", stat)]],
player.avgs[[paste0("avg.", stat)]],
paired=TRUE
)
results[[paste("on.bday", stat, "vs avg", stat)]] <- test2
}

# Print results
results

The following are the results printed to the console (p-values bolded for emphasis):

$`on.bday pts vs avg.before.bday pts`

Paired t-test

data: player.avgs[[paste0("on.bday.", stat)]] and player.avgs[[paste0("avg.before.bday.", stat)]]
t = 0.89615, df = 75, p-value = 0.373
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-0.7858421 2.0710063
sample estimates:
mean difference
0.6425821


$`on.bday pts vs avg pts`

Paired t-test

data: player.avgs[[paste0("on.bday.", stat)]] and player.avgs[[paste0("avg.", stat)]]
t = 0.4983, df = 75, p-value = 0.6197
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-1.076084 1.793999
sample estimates:
mean difference
0.3589577


$`on.bday ast vs avg.before.bday ast`

Paired t-test

data: player.avgs[[paste0("on.bday.", stat)]] and player.avgs[[paste0("avg.before.bday.", stat)]]
t = 2.5147, df = 75, p-value = 0.01405
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.1092889 0.9424413
sample estimates:
mean difference
0.5258651


$`on.bday ast vs avg ast`

Paired t-test

data: player.avgs[[paste0("on.bday.", stat)]] and player.avgs[[paste0("avg.", stat)]]
t = 1.8542, df = 75, p-value = 0.06764
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-0.02732554 0.76238942
sample estimates:
mean difference
0.3675319

According to my analysis and these T-Tests, there seems to have been a statistically-significant discrepancy between players’ birthday assists tallies and their running tallies; the p-value of 0.01 suggests with a high degree of certainty that last year, players averaged 0.5 more assists on their birthdays than they did in games played prior to their birthdays. While 0.5 may not seem like a lot of assists, it becomes more significant when considering that players in general record very few assists a game — fewer than 40 players averaged more than 5 assists a game last year.

This observed increase in assists on players’ birthdays certainly seems noteworthy. Perhaps the added excitement or drive to perform on their special day enhances players’ ability to facilitate on the court. Perhaps the teammates of these celebrating players might also be more driven to score off the birthday-players’ passes. There is a lot of room to speculate about the interpretation and cause of this unique finding, a point worth exploring in future analyses.

I would be remiss to not caveat my results by considering the Bonferroni Correction. The correction proposes that when performing multiple statistical tests, the accepted p-value should be adjusted based on the number of tests, given that the likelihood of observing a spurious significant result increases with more tests. Using a common p-value threshold of 0.05 and dividing by the four tests performed, the adjusted threshold becomes 0.0125. This is marginally less than the p-value of 0.014 I observed. However, were I to be more stringent in light of six other t-tests I originally considered conducting, the required p-value plummets to a strict 0.005, which would not validate our observation regarding assists.

Although the Bonferroni Correction tempers the robustness of my findings, it’s worth mentioning that other less conservative correction methods exist. Thus, all things considered, the anomaly in players’ birthday assists when compared to their ongoing averages remains a fascinating observation.

--

--