Predictive Modelling NBA Games using Excel and Python 🏀

Published in

Football Applied

29 min readJul 1, 2024

In my attempt to create an accurate prediction model of NBA games and playoff results, I used a player-based stats simulation model, overall team-rating based simulation model, and an ELO rating model via Python. Basketball’s smaller team size enables for the effective inclusion of individual player statistics. Furthermore, the binary structure of basketball outcomes — wins or losses without ties — allows for the use of various regression techniques such as OLS, logit, and probit without accounting for drawn games. Our interest in the various dynamics of the regular season and playoffs in basketball influenced our decision, raising concerns regarding the accuracy of our forecasting model and potential variations between regular season and postseason performance.

Raw Player Stats = Contains all relevant player stats for all teams
Sheet3 = Adjusted Defensive Ratings
Sheet4 = Simulates the expected performance metrics for players part of a 10-man rotation for a given team (eg: XLOOKUP + random simulate points, assists, rebounds, steals, blocks, etc...) for each player
Simulations = Runs 1000 simulations to give ample sample size to determine who would win more then the 2 teams
Play Working = Fetches the relevant players for a given team by minutes played and position

· The player-based simulation model
∘ Playoff permutations
∘ Future extensions of the model
· Team-based simulation model
∘ Play-in Tournament
∘ Setting up the Play-offs model
∘ My Results
∘ Conference Semi-Final Matchups
∘ Conference Finals Matchups
∘ Western Conference Champions: Oklahoma City Thunder (OKC wins series 4–2)
∘ Eastern Conference Champions: Boston Celtics (Boston Celtics wins series 4–2)
∘ NBA Champions: Boston Celtics
· ELO Ratings Model using Python
· Augmenting the ELO Model based on different k-factors

The player-based simulation model

Creating a player-based simulation model using Excel isn’t necessarily difficult, but rather requires plenty of attention, and can be quite time-consuming. This simulation model is better used for teams that are more superstar-centric such as the LA Lakers or the Phoenix Suns, and has very high usage rates for their star players.

Fetch all player stats from Basketball Reference and download stats as CSV file.
From the ‘Raw Player Stats’ sheet where you’ve pasted all the player stats from Basketball Reference, fetch the unique team names and create a list. (=UNIQUE(‘Raw Player Stats’!$E$2:$E$573))
From the Unique teams list, create a list call “teams”. This step is purely optional but better to do as it saves time.
In a new sheet, in my document is was Sheet4, fetch the Home and Away Teams based on a list. This allows you to change between teams for the model, rather then being limited to a select few. This step is extremely important as without the list, you won’t be able to run yield the players and stats based on a given team. You can see in the image below of how it should look like.

5. In a new sheet, in my document it was Play Working, we can take the home team and away team based on the teams we select from the list in Sheet4. So in my case, PlayWorking!A5=Sheet4!E2 for home team, and PlayWorking!A32=Sheet4!E2

6. Now this part is the trickiest as it requires the most complicated formula in this whole model. Essentially we need to fetch all the players of a given team we’ve selected from the list, either for home team or away team. In the example above, given the Home Team is Boston Celtics and the Away team is Dallas Mavericks, we would need to fetch all the players in the roster for both teams:

=IFERROR(INDEX('Raw Player Stats'!$B$2:$B$573, SMALL(IF('Raw Player Stats'!$E$2:$E$573 = XLOOKUP(Sheet4!$E$2, teams, team_id), ROW('Raw Player Stats'!$E$2:$E$573) - ROW('Raw Player Stats'!$E$2) + 1), ROWS($A$1:A1))), "")

'Raw Player Stats'!$B$2:$B$573= Looks at all players
'Raw Player Stats'!$E$2:$E$573 = Looks at the team they play for
'Raw Player Stats'!$E$2:$E$573 = XLOOKUP(Sheet4!$E$2, teams, team_id = Based on the team they play for, use XLOOKUP to fetch all the players that play for the team based on the list teams
ROW('Raw Player Stats'!$E$2:$E$573) - ROW('Raw Player Stats'!$E$2) + 1), ROWS($A$1:A1))), "") = The data is in row format and so will return the data in rows accordingly

Adjust the formula accordingly for your dataset

After writing the formula, it should return a list of the players, likely to be sorted from A-Z in terms of surname:

7. Once that has been sorted, you can run a simple XLOOKUP function to fetch the minutes per game each player has played. This is for later on where we sort the players that play in terms of minutes played. We also run a XLOOKUP function to fetch the positions that these players play. This is to ensure that beyond looking at who played the most minutes, some positions are rotated more then others, hence why we need to sort by position as well. Here’s what it should look like:

Minutes Played: =IFERROR(XLOOKUP(A5, 'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$H$2:$H$573),"")
No. of Players: =COUNTIF(B5:B30,">=0")
Fetch Position: =IFERROR(XLOOKUP(A5,'Raw Player Stats'!$B$2:$B$573,'Raw Player Stats'!$C$2:$C$573),"")
Sorting in terms of Minutes Played: =SORT(A5:OFFSET(A5,$C$5-1,1),{2},-1)
Fetching the position based on the player: =IF(XLOOKUP(I5,'Raw Player Stats'!$B$2:$B$573,'Raw Player Stats'!$C$2:$C$573)=0,"",(XLOOKUP(I5,'Raw Player Stats'!$B$2:$B$573,'Raw Player Stats'!$C$2:$C$573)))

Now rinse and repeat the same process for the away team. The formulas should be the same, so the only adjustment should be the cell references when sorting, and fetching players based on team/team_id.

8. For the next part, we go back to Sheet4!, and now just fetch the specific stats for each player. The things we need to do for this part is:

Fetch the 10–12 man rotation in terms of minutes played and position in Sheet4!
Ensure the right players are being played in the right position. Ie: each team’s starting 5 ideally should have a PG, SG, PF, C, SF (Point Guard, Shooting Guard, Power Forward, Center, Small Forward)
Fetch all performance related metrics:

Assume A12= Position name; eg: A12=PG
B12: Player= =XLOOKUP(A12,'Play Working'!$K$5:$K$30, 'Play Working'!$I$5:$I$30)
C12: MP(Minutes Played)=XLOOKUP(B12, 'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$H$2:$H$573)
D12: xMP (Expected Minutes Played)=IF(M12<=5, C12,C12/SUM('Play Working'!$J$10:$J$30)*(240-SUM('Play Working'!$J$5:$J$9)))
E12: FGA+FTA(Field Goals Attemped + Free Throws Attempted): =IF(D12>0, XLOOKUP(B12,'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$J$2:$J$573)+XLOOKUP(B12, 'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$T$2:$T$573),0)
F12: xFGA+FTA(Expected Field Goals Attemped + Free Throws Attempted): =((0.0182*(D12)^2-0.1186*(D12)+2.3528)/(0.0182*(C12)^2-0.1186*(C12)+2.3528))*E12
G12: 2PA Poss(What % of shots are 2P's): =XLOOKUP(B12, 'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$P$2:$P$573) / (XLOOKUP(B12, 'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$P$2:$P$573) + XLOOKUP(B12, 'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$M$2:$M$573) + XLOOKUP(B12, 'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$T$2:$T$573))
H12: 2P FG%: =XLOOKUP(B12,'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$Q$2:$Q$573)
I12: 3PA Poss%: =1-G12-K12 (Could also use XLOOKUP for this)
J12: 3P%: =XLOOKUP(B12,'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$N$2:$N$573)
K12: FT Poss = = XLOOKUP(B12, 'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$T$2:$T$573) /
  (XLOOKUP(B12, 'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$P$2:$P$573) +
   XLOOKUP(B12, 'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$M$2:$M$573) +
   XLOOKUP(B12, 'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$T$2:$T$573))
L12: FT% : =XLOOKUP(B12,'Raw Player Stats'!$B$2:$B$573, 'Raw Player Stats'!$U$2:$U$573)
M12: Rank (in terms of MPG): =RANK.EQ(XLOOKUP($B12,'Play Working'!$I$5:$I$30,'Play Working'!$J$5:$J$30),'Play Working'!$J$5:$J$30)
N12: Simulated Points =IF(D12>0, IFERROR(2*BINOM.INV(BINOM.INV(F12,G12,RAND()),H12,RAND())+3*BINOM.INV(BINOM.INV(F12,I12,RAND()),J12,RAND())+BINOM.INV(BINOM.INV(F12,K12,RAND()),L12,RAND()),ROUND((F12/E12)*XLOOKUP(B12,'Raw Player Stats'!$B$2:$B$573,'Raw Player Stats'!$AD$2:$AD$573),0)),0)

The provided Excel formulas are part of a comprehensive model designed to estimate a player’s performance in basketball based on their position, historical stats, and various in-game metrics. Starting with `A12`, which holds the position name (e.g., PG for point guard), `B12` uses the `XLOOKUP` function to fetch the player’s name from the ‘Play Working’ sheet based on the position. `C12` retrieves the minutes played (`MP`) from the ‘Raw Player Stats’ sheet using `XLOOKUP`.

In cell `D12`, the formula calculates the expected minutes played (`xMP`). If the value in `M12` is less than or equal to 5, `xMP` is set to the actual minutes played (`C12`). Otherwise, it adjusts `C12` based on the ratio of the sum of specific player minutes to the total possible minutes (240), excluding certain periods. This adjustment helps to normalize the minutes played for each player.

Cell `E12` computes the total number of field goals attempted plus free throws attempted (`FGA+FTA`). It does so by summing field goals attempted and free throws attempted using `XLOOKUP` to pull the relevant stats from the ‘Raw Player Stats’ sheet. In `F12`, the expected field goals attempted plus free throws attempted (`xFGA+FTA`) is estimated using a polynomial formula that considers the player’s expected minutes and adjusts it based on a function derived from historical data.

The formula in `G12` calculates the percentage of shots taken as two-point attempts (`2PA Poss`). This is derived by dividing the number of two-point attempts by the sum of two-point attempts, three-point attempts, and free throws attempted, all fetched via `XLOOKUP`. In `H12`, the 2-point field goal percentage (`2P FG%`) is retrieved from the ‘Raw Player Stats’ sheet. `I12` computes the percentage of shots taken as three-pointers (`3PA Poss%`) by subtracting the two-point percentage and free throw percentage from 1.

The three-point percentage (`3P%`) is fetched in `J12` using `XLOOKUP`, while `K12` calculates the free throw percentage (`FT Poss`) as the ratio of free throw attempts to the sum of two-point, three-point, and free throw attempts. The free throw percentage (`FT%`) is then retrieved in `L12` using `XLOOKUP`.

The rank of the player in terms of minutes per game (`Rank in MPG`) is computed in `M12` using the `RANK.EQ` function on the player’s minutes compared to others, fetched from the ‘Play Working’ sheet. Finally, `N12` calculates the simulated points using a combination of binomial distribution functions and error handling. It simulates the number of successful 2-point, 3-point, and free throw attempts and sums them up, adjusting for the player’s actual field goal attempts and points per game. If the binomial simulation fails, the formula defaults to a fallback calculation that uses the ratio of expected to actual field goals and the player’s points per game from the ‘Raw Player Stats’ sheet.

Repeat the process for all the other players in the roster list. It should look something like this:

At this stage, you should repeat the same process for the away team, and adjust the reference cells to yield the correct results.

9. In a new sheet called Simulations, first fetch the initial score from Sheet4!. In other words, take the sum of the simulated points for both teams and list them in their respective cells in the Simulations worksheet. It should look something like this:

=SUM(Sheet4!N12:N25)

Following that, to run a Monte-Carlo style simulation, you go to Data -> What-if Analysis -> Data Table, and put the relevant row and column data. Since in this case I ran 1000 trials, I dragged down my reference cells all the way to 1001. Your spreadhseet should look something like this:

Now going back to Sheet4!, to calculate the points scored by both teams, we simply take the average of the points scored by a given team throughout the 1000 trials: =ROUND(AVERAGE(Simulations!$B$3:$B$1002),0)

Given basketball is a zero-sum game and we don’t need to worry about draws/ties, to calculate the probability of a given team winning based on the simulations, we simply tally up the amount of times a give team scored more points then the other team. It should look like this: =(COUNTIF(Simulations!D3:D1002,”Win”)/1000)+(0.5*COUNTIF(Simulations!D3:D1002,”Tie”)/1000)

Playoff permutations

At this point, the model essentially tells you who’s more likely to win between 2 teams based on player performance metrics for a one-off game. However, we can extend this to the case of a playoff series. If we assumed that the home team and away team are playing in a playoff series, we can simply calculate the probability of either the home team or away team winning the playoff series based on who’s more likely to first win 4 games. Furthermore, given that in basketball, home court advantage historically does not have much of a say given most series don’t make it to a game 7, the ordering shouldn’t necessarily matter.

However, before we continue with determinign the probabilities, we first need to understand conditional probabilities and combinations. Conditional probability is known as the possibility of an event or outcome happening, based on the existence of a previous event or outcome. It is calculated by multiplying the probability of the preceding event by the renewed probability of the succeeding, or conditional, event.

For example, only 1 team in the NBA has come back from a 3–1 deficit to win the NBA finals, however, no team in the NBA has ever made a comeback from 3–0 down to win 4–3 in any playoff series. Thus, historically speaking, it’s extremely crucial for a team that’s 2–0 down to win game 3 as that determines how far that series can go.

In terms of how we apply the relevant formulas to predicting the probabilities in the playoff series, here’s an example 🏀. Heading into game 5, the Boston Celtics were 3–1 up against the Dallas Mavericks in the 2024 NBA Finals, however, there’s multiple different combinations into how the score could be 3–1. For example, Boston could’ve won the first 3 and lost game 4, won the first 2, lost 3, and and 4, and many other combinations. In fact, in total there’s 4 different ways as to how the series could be 3–1 to Boston . So, assuming that we’re trying to calculate the probability of the finals being 3–1 after 4 games:

O9: =(COUNTIF(Simulations!D3:D1002,"Win")/1000)+(0.5*COUNTIF(Simulations!D3:D1002,"Tie")/1000)
O27: =(COUNTIF(Simulations!D3:D1002,"Loss")/1000)+(0.5*COUNTIF(Simulations!D3:D1002,"Tie")/1000)
P(3-1 to Team1)==O9*O9*O9*O27*4

We would then repeat a similar thought process and extend it to all the other different combinations as to how a playoff series can go. However, it’s worth noting that we don’t need to multiply the conditional probabilities by the # of combinations for single possibility events. In other words, there’s only one way for a give team to be 1–0, 2–0, 3–0, or complete a clean sweep, so for those possibilities and those only, we would just need to multiple the probability of Team1 winning n amount of games.

After working and finding all the possibilities, the table should look something like this:

Green = Home Team wins series

Dark Orange = Tied series

Blue= Away Team win series

So based on this recent 2024 NBA Finals, the model predicted that the most probably outcome would be Boston Celtics winning it in 6, at 19.0%. However, the actual outcome, which was Celtics winning it in 5, wasn’t far off with a probability of 17.2% occurring. At the bottom of the table, the probability of the home team or away team winning is simply just taking the sum of the relevant cells.

Basketball game outcomes can be predicted using player-based performance measures, which provides a thorough understanding of individual player contributions and enables a sophisticated analysis of matchups and player interactions. This granularity improves the model’s accuracy in simulating different scenarios by helping to comprehend how individual players affect game dynamics. But depending just on player analytics can leave out many important aspects of the team dynamics, such coaching techniques and teamwork, which can also have a big impact on how games turn out. Furthermore, since precise and comprehensive player data is necessary for trustworthy predictions, maintaining complexity and assuring data quality present formidable obstacles. It is imperative to strike a balance between individual player performance and more comprehensive team elements when creating reliable and effective simulation models for basketball prediction.

Future extensions of the model

Going forward, I’ll be hoping to extend the accuracy of the model by including advanced player performance-based metrics, such as USG% (usage rate), WS (win share %), and VORP (Value-Over-Replacement-Player), which essentially measures the marginal utility of the individual players.

Team-based simulation model

On the contrary to the player-based model which takes the player’s performance metrics, the team-based simulation model predicts on the overall team’s performance metrics. This model would likely yield more accurate results for teams that run deeper rotations, or have a system in place that isn’t iso-based or star based. For example, for teams such as the Miami Heat and the Indiana Pacers, their style of play isn’t accurately reflected by the player’s box scores, and given they don’t have a primary scorer per se, such measures like offensive rating and defensive rating is a more accurate reflection of the team.

Here’s how we set up and run the model:

Fetch all team-related metrics from basketball reference for all teams, specifically adjusted/non-adjusted ratings. To generate the attacking rating, you simply calculate the average of the ratings, and divide the given team’s rating by the average rating. For example, given the Milwaukee Bucks had an adjusted offensive reating of 119 and the average is 116.185, their att_rating (attacking rating) is 1.065. Repeat the same process to generate defensive ratings.

2. In a new sheet, list all the NBA regular season fixtures throughout the entire season. For this model, let’s assume that we’re forecasting for the 2024–25 NBA season, assuming rosters/rotations have remain unchanged, and the fixtures remain the same. You should just have the home team and away team, as well as 2 empty columns to print the simulated results.

3. Once the sheet has been properly formatted, you can plug in this formula:

=BINOM.INV(10000,(XLOOKUP(C2,teams, att_rating)*XLOOKUP(E2,teams,def_rating)*Sheet1!$W$34)/10000, RAND())

The formula `=BINOM.INV(10000,(XLOOKUP(C2,teams, att_rating)*XLOOKUP(E2,teams,def_rating)*Sheet1!$W$34)/10000, RAND())` in Excel serves to simulate a binomial distribution, commonly used in statistical modeling to predict outcomes where there are two possible results per trial. Here’s how it works: the function `BINOM.INV` calculates the inverse of the cumulative binomial distribution for 10,000 trials (`10000`). Inside the function, `XLOOKUP(C2, teams, att_rating)` and `XLOOKUP(E2, teams, def_rating)` fetch offensive and defensive ratings from a structured data range (`teams`), while `Sheet1!$W$34` is a constant. These ratings are multiplied and divided by 10,000 to adjust for probabilities, ultimately generating a probabilistic outcome using `RAND()` for randomness.

4. On the same sheet or a separate sheet, list down the teams, and collect their win/loss tallies based on regular season results:

Wins: =SUMPRODUCT(((home_team=O3)*(home_scores>away_scores))+((away_team=O3)*(away_scores>home_scores)))
Losses: =N3-Q3
N3: Games Played= 82

At this point, it should look like this:

5. At this point, you can select the whole table, and sort it by win percentage (PCT) and Conference in order to get accurate seeding. :

It’s worth noting that given the results are simulated, your results should change depending on the # of trials you’ve run.

Play-in Tournament

In the NBA Play-in Tournament, the 7th seed plays the 8th seed, and the 9th seed plays the 10th seed. The winner of the 7th vs 8th seed game will qualify for the playoffs, whilst the loser of this game will play the winner of the 9th v 10th seed game; the loser of the 9th v 10th seed game will be eliminated from play-off contention.

Explaining the context is important as it gives us an idea of how we progress with the model after we collected the regular season data.

*Adjust your reference cells accordingly*

Eastern Conference:
7th vs 8th: O56 vs O57
9th vs 10th: O58 vs O59

Western Conference: 
7th vs 8th: 041 vs O42
9th vs 10th: O43 vs O44

Now after getting the teams, we need to predict the outcome and how much each team would score, of which we would simply use the same BINOM.INV formula we used for the regular season fixtures.

After getting the scores, we now need to determine which teams have gone through, are eliminated, or have to play in the eliminator.

W4= O56
X4 = O57
W5= O58
X5 = O59
Y4 = O56's score
Z4 = O57's score
Y5 = O58's score
Z5 = O59's score

*The eliminator is between the loser of the 7th vs 8th seed game and the winner of the 9th vs 10th seed game*

Thus, 
Loser of first game: =IF(Y4 < Z4, W4, X4)
Winner of second game: =IF(Y5 > Z5, W5, X5)

7th seed from tournament: =IF(Y4 > Z4, W4, X4)
8th seed for tournament: =IF(Y6 > Z6, W6, X6)

Repeat the same process for the other conference

Setting up the Play-offs model

After the play-in tournament for both conferences are complete, we now know the matchups for the first round:

1st seed vs 8th seed (Series 1)
2nd seed vs 7th seed (Series 2)
3rd seed vs 6th seed (Series 3)
4th seed vs 5th (Series 4)

Given that the top 4 seeds have home court advantage, we could simply reference top 4 teams from the regular season table (Eg: =Sheet2!O50:O53), and repeat the same process for the 5th and 6th team. For the 7th and the 8th seeds, we get that information from the winner of the play-in tournament.

Now, we repeat the same process as we did to simulate regular season results and the play-in results by running the BINOM.INV simulations (=BINOM.INV(10000,(XLOOKUP(I3,teams, att_rating)*XLOOKUP(J3,teams,def_rating)*Sheet1!$W$34)/10000, RAND()))

Since we know that the first team to win 4 games progresses to the next round, we need to set a cumulative counter that determines the amount of games both teams have won:

Once we get the series winners, we can match them up accordingly for the conference semi-finals:

Winner of Series 1 vs Winner of Series 4 (CSF1)
Winner of Series 2 vs Winner of Series 3 (CSF2)

Then repeat the same process as for the previous round, and you should be able to get your conference finals combination, which is the Winner of CSF1 vs CSF2.

After that, you should be able to get your conference finals winners, and thus your NBA Finals matchup, and then your NBA Champions.

Given that it’s a simulation, you’re results should not remain unchanged, however, here are my results from the latest simulations I ran:

My Results

Play-in Tournament Winners:

Cleveland Cavaliers (7th seed Eastern Conference)

Indiana Pacers (8th seed Eastern Conference)

Dallas Mavericks (7th seed Western Conference)

Los Angeles Lakers (7th seed Western Conference)

First- Round Matchups:

Results:

Conference Semi-Final Matchups

Results

Conference Finals Matchups

Western Conference Champions: Oklahoma City Thunder (OKC wins series 4–2)

Eastern Conference Champions: Boston Celtics (Boston Celtics wins series 4–2)

NBA Champions: Boston Celtics

Boston Celtics clean sweep the Oklahoma City Thunder and win 4–0.

All in all, whilst the first model is able to predict and simulate the outcomes of one-off games and playoff series, given the simplified nature of extracting team performance based metrics for this model, this model is the most reliable for simulating the entire NBA season, from October to June.

When considering whether to use player-based or team-based performance metrics for creating an ideal predictive model, examining usage rates becomes crucial in distinguishing between the two approaches. Player-based metrics, such as individual shooting percentages, assists per game, and defensive ratings, directly reflect how effectively each player contributes to their team’s performance. Usage rates provide insights into how involved each player is in offensive plays, highlighting their importance in scoring and playmaking roles. By focusing on usage rates, analysts can identify key players who have a significant impact on their team’s outcomes based on how frequently they handle the ball and engage in scoring opportunities.

In contrast, team-based metrics aggregate overall team statistics without emphasising individual player contributions as distinctly. While team metrics like points per game, field goal percentage, and defensive efficiency offer a broad view of collective performance, they may obscure the specific roles and influences of individual players within the team dynamics. Usage rates help in discerning which players are the primary offensive initiators, facilitators, or defensive anchors, thereby providing a more nuanced understanding of how player performances translate into team outcomes.

Integrating usage rates into a predictive model allows analysts to balance the insights from both player-based and team-based metrics effectively. By incorporating usage rates alongside individual and team statistics, analysts can better identify critical players who drive offensive strategies or defensive resilience. This comprehensive approach enhances the model’s ability to predict game outcomes accurately by capturing both the collective dynamics of team play and the impactful roles of key players. Ultimately, usage rates serve as a valuable tool in optimizing the balance between player-centric and team-centric perspectives in sports analytics, leading to more informed decision-making and strategic insights in coaching and player management.

ELO Ratings Model using Python

To understand the dynamics of the sport and how teams have performed in the NBA post-COVID, we’ve looked at the ELO ratings of all 30 NBA teams from October 2021-June 2024.

The Elo system was invented as an improved chess-rating system over the previously used Harkness system,[1] but is also used as a rating system in association football, American football, baseball, basketball, pool, various board games and esports, and more recently large language models.
The difference in the ratings between two players serves as a predictor of the outcome of a match. Two players with equal ratings who play against each other are expected to score an equal number of wins. A player whose rating is 100 points greater than their opponent’s is expected to score 64%; if the difference is 200 points, then the expected score for the stronger player is 76%.[2]

The ELO rating model is implemented in Python, leveraging historical NBA game data to calculate and update team ratings. The process begins by importing necessary libraries and reading the data from a CSV file. Unique teams are identified, and their match counts are calculated to understand the dataset’s structure.

ELO ratings are initialised at 1000 for all teams, and a K-factor of 10 is set to control the sensitivity of rating changes. The calculate_elo_rating function computes the expected outcome and ELO change for a given match, while the update_elo_ratings function iterates through the dataset to update ratings based on match outcomes.

In each iteration, the function fetches the teams involved in the match, determines the winner, and calculates the new ELO ratings. The expected outcomes and rating changes are updated in the DataFrame, ensuring that the ratings reflect the latest match results. This updated DataFrame is then saved to a CSV file for further analysis.

import os
import json
import pandas as pd
import pydash
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
import csv
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import seaborn as sns

nba = pd.read_csv("/Users/samyukth/Downloads/nbaelo.csv", index_col=0)

nba.columns

unique_teams = pd.concat([nba['team1'].str.strip(), nba['team2'].str.strip()]).unique()

# Print unique teams
print(unique_teams)

nba_count = {}
for team in unique_teams:
    nba_count[team] = ((nba['team1'].str.strip() == team) | (nba['team2'].str.strip() == team)).sum()

nba_count_df = pd.DataFrame(list(nba_count.items()), columns=['Team', 'MatchesPlayed'])

nba_count_df = nba_count_df.sort_values(by='MatchesPlayed', ascending=False)

print(nba_count_df)

unique_teams = pd.concat([nba['team1'], nba['team2']]).str.strip().unique()
elo_ratings = {team: 1000 for team in unique_teams}

# Set K-factor
k_factor = 10

def calculate_elo_rating(team1_rating, team2_rating, outcome, k_factor):
    expected1 = 1 / (1 + 10 ** ((team2_rating - team1_rating) / 400))
    elo_change = k_factor * (outcome - expected1)
    return expected1, elo_change

def update_elo_ratings(nba, elo_ratings, k_factor):
    for index, match in nba.iterrows():
        team1 = str(match['team1']).strip()
        team2 = str(match['team2']).strip()
        
        winner = match['Winner'] if pd.notna(match['Winner']) else ''
        
        if winner.strip() == team1:
            outcome_team1 = 1  # Win for team1
            outcome_team2 = 0  # Loss for team2
        elif winner.strip() == team2:
            outcome_team1 = 0  # Loss for team1
            outcome_team2 = 1  # Win for team2
        else:
            outcome_team1 = 0  # Draw
            outcome_team2 = 0  # Draw

        # Get current Elo ratings
        team1_rating = elo_ratings.get(team1, 1000)
        team2_rating = elo_ratings.get(team2, 1000)

        # Calculate Elo changes and expected outcomes
        expected1, elo_change1 = calculate_elo_rating(team1_rating, team2_rating, outcome_team1, k_factor)
        expected2, elo_change2 = calculate_elo_rating(team2_rating, team1_rating, outcome_team2, k_factor)

        # Update Elo ratings in the dictionary
        elo_ratings[team1] += elo_change1
        elo_ratings[team2] += elo_change2

        # Also update the Elo ratings and expected outcomes in the DataFrame
        nba.at[index, 'team1_rating'] = team1_rating
        nba.at[index, 'team2_rating'] = team2_rating
        nba.at[index, 'team1_newrating'] = elo_ratings[team1]
        nba.at[index, 'team2_newrating'] = elo_ratings[team2]
        nba.at[index, 'team1_expected'] = expected1
        nba.at[index, 'team2_expected'] = expected2
        nba.at[index, 'outcometeam1'] = outcome_team1
        nba.at[index, 'outcometeam2'] = outcome_team2
    
    return elo_ratings

# Extract unique teams
unique_teams = pd.concat([nba['team1'], nba['team2']]).astype(str).str.strip().unique()

# Initialize Elo ratings dictionary
elo_ratings = {team: 1000 for team in unique_teams}

# Set K-factor
k_factor = 10

# Initialize Elo ratings columns in the matches DataFrame
nba['team1_rating'] = nba['team1'].astype(str).map(lambda x: elo_ratings.get(x.strip(), 1000)).astype('float64')
nba['team2_rating'] = nba['team2'].astype(str).map(lambda x: elo_ratings.get(x.strip(), 1000)).astype('float64')
nba['team1_newrating'] = None
nba['team2_newrating'] = None
nba['team1_expected'] = None
nba['team2_expected'] = None
nba['outcometeam1'] = None
nba['outcometeam2'] = None

# Update Elo ratings based on matches data
elo_ratings = update_elo_ratings(nba, elo_ratings, k_factor)

# Display updated Elo ratings
for team, rating in elo_ratings.items():
    print(f"{team}: {rating}")

# Print the updated DataFrame
print(nba[['id', 'team1', 'team2', 'team1_rating', 'team2_rating', 'team1_newrating', 'team2_newrating', 'team1_expected', 'team2_expected', 'outcometeam1', 'outcometeam2']])

output_file = 'nbaeloo.csv'
nba.to_csv(output_file, index=False)

import matplotlib.pyplot as plt
import seaborn as sns

elonbaratings = {
    'MIL' : 1056.50377254145,
    'LAL' : 1055.7964880581887,
    'DET' : 773.9219013889455,
    'CHO' : 862.1614346800941,
    'NYK' : 1073.9829199229232,
    'TOR' : 894.7493896593153,
    'NOP' : 1033.0551963189826,
    'MIN' : 1099.1828060752855,
    'MEM' : 931.8066699441748,
    'SAS' : 859.0659176880046,
    'UTA' : 921.52522421674,
    'POR' : 842.3095817666201,
    'PHO' : 1064.1878777984355,
    'ATL' : 968.6283038152345,
    'MIA' : 1044.8228498752976,
    'GSW' : 1048.0299384614661,
    'CLE' : 1039.0889882254073,
    'ORL' : 1008.4059869645248,
    'WAS' : 824.3953595642017,
    'PHI' : 1063.5257584001104,
    'BOS' : 1209.247471991062,
    'CHI' : 992.5413334916867,
    'HOU' : 955.527698914977,
    'DEN' : 1138.2480025455193,
    'SAC' : 1033.3919076659977,
    'IND' : 1029.1673116386419,
    'LAC' : 1054.5531159239165,
    'BRK' : 941.0939370602629,
    'OKC' : 1090.5425454344106,
    'DAL' : 1090.540309968122
}

elo_df = pd.DataFrame(list(elonbaratings.items()), columns=['Team', 'Elo Rating'])
output_file = 'nbaelo5.csv'
elo_df.to_csv(output_file, index=False)

elo_df = elo_df.sort_values(by='Elo Rating', ascending=False)

print(elo_df)

The provided script is a comprehensive tool for analysing NBA Elo ratings using Python. Initially, the script imports several libraries, including pandas for data manipulation, Plotly for visualization, and Scikit-learn for standard scaling. It begins by loading an NBA Elo dataset from a CSV file and proceeds to extract unique team names from the team1 and team2 columns. This extraction ensures that all teams participating in the matches are accounted for in the subsequent analysis.

To assess the number of matches each team has played, the script concatenates and strips any whitespace from the team columns, then calculates the total matches for each team. This data is stored in a dictionary and subsequently converted into a DataFrame, which is sorted to display teams based on their total matches played.

The core of the analysis revolves around calculating and updating Elo ratings, which are a measure of team strength. Each team is initialized with a default Elo rating of 1000. The script defines functions to calculate the expected outcome of a match and the Elo rating changes based on the match results. The expected outcome is determined by the difference in current Elo ratings between two teams, while the change in Elo rating depends on the actual match outcome and a predefined K-factor, set to 10 in this case.

The update_elo_ratings function iterates through each match in the dataset, updating the Elo ratings of the teams involved based on whether they won or lost. This function also updates the DataFrame with the new ratings, expected outcomes, and actual match outcomes. By the end of this process, each match entry in the DataFrame includes comprehensive information on pre-match and post-match Elo ratings, enhancing the dataset's analytical value.

After updating the Elo ratings, the script prints the final ratings for each team and saves the updated DataFrame to a CSV file. Additionally, the script creates a bar plot to visualize the Elo ratings of the teams using Matplotlib and Seaborn. This visualization provides a clear and intuitive display of team strengths based on their Elo ratings.

The script concludes by demonstrating how to filter the DataFrame to visualize Elo ratings over time for specific teams, facilitating further detailed analysis of team performance trends. This comprehensive approach to calculating, updating, and visualising NBA Elo ratings offers valuable insights into team performance and dynamics throughout the season.

Given the snake-like shape of the model, the Indiana Pacers’ elo-rating essentially acts as a point of inflection, as it has one of the highest offensive ratings in the league, but also one of the worst defensive ratings. For teams worse then the Pacers, there seems to be a downward linear trend to the worst team, whereas for teams better then the Pacers, the trend is a very flat linear upward trend until it reaches the Phoenix Suns’ ELO rating. For most of the scatterplot, the gap between the teams seems to be reflective of their win percentage, given most of the higher seeded teams beat the lower seeded teams, however, there’s a clear competitive discrepancy between conferences.

The Orlando Magic, for example, finished as the 5th seed and despite a strong 24–12 run to end the regular season and a win percentage of 57.3%, they hover near the initial rating line of 1000, which implies that they matchup poorly with teams worse them then. On the contrary, the Dallas Mavericks finished as the 5th seed and reached the NBA finals having beaten 3 50+ wins teams, including a gentleman’s sweep against the best defense in the league, which explains why their ELO rating is higher then their win/loss record would initially suggest.

Furthermore, out of top 10 ELO-rated teams in the league, 7 of these teams are from the Western Conference, including 3 of those teams who got eliminated from the first round. In fact, if we solely took matches after the trade deadline as our data source, the Dallas Mavericks, for example, had the 2nd best ELO rating, whereas the LA Clippers, who finished above the Mavericks in terms of seeding, performed like a play-in team, largely due to injuries and key absentees. Ultimately, in high-level sports, where the margins are so so fine, momentum and form plays a huge factor, which can help explain the momentum these teams come with heading into the playoffs. For example, ranging from the Orlando Magic to the Cleveland Cavaliers, 7 of these teams made the playoffs, but 6 out of these 7 teams got eliminated in the first round.

Augmenting the ELO Model based on different k-factors

In the dataset with different k-factor, the script defines functions for Elo rating calculation (`calculate_elo_rating`) and determining the K-factor (`determine_k_factor`) based on the playoff status of the match. The K-factor adjustment ensures that playoff matches, where stakes are higher, contribute more significantly to Elo rating changes.

import pandas as pd

# Read the NBA data
nba = pd.read_csv("/Users/user/Downloads/filename.csv", index_col=0)

# Function to calculate Elo rating
def calculate_elo_rating(team1_rating, team2_rating, outcome, k_factor):
    expected1 = 1 / (1 + 10 ** ((team2_rating - team1_rating) / 400))
    elo_change = k_factor * (outcome - expected1)
    return expected1, elo_change

# Function to determine K-factor based on playoff value
def determine_k_factor(playoff):
    if playoff == 0:
        return 10
    elif playoff == 0.5:
        return 15
    elif playoff == 1:
        return 25
    elif playoff == 2:
        return 30
    elif playoff == 3:
        return 40
    elif playoff == 4:
        return 50
    else:
        return 10  # Default K-factor if playoff value is not in the specified range

# Function to update Elo ratings
def update_elo_ratings(nba, elo_ratings):
    for index, match in nba.iterrows():
        team1 = str(match['team1']).strip()
        team2 = str(match['team2']).strip()
        
        winner = match['Winner'] if pd.notna(match['Winner']) else ''
        
        if winner.strip() == team1:
            outcome_team1 = 1  # Win for team1
            outcome_team2 = 0  # Loss for team2
        elif winner.strip() == team2:
            outcome_team1 = 0  # Loss for team1
            outcome_team2 = 1  # Win for team2
        else:
            outcome_team1 = 0  # Draw
            outcome_team2 = 0  # Draw

        # Determine K-factor based on playoff value
        k_factor = determine_k_factor(match['playoff'])

        # Get current Elo ratings
        team1_rating = elo_ratings.get(team1, 1000)
        team2_rating = elo_ratings.get(team2, 1000)

        # Calculate Elo changes and expected outcomes
        expected1, elo_change1 = calculate_elo_rating(team1_rating, team2_rating, outcome_team1, k_factor)
        expected2, elo_change2 = calculate_elo_rating(team2_rating, team1_rating, outcome_team2, k_factor)

        # Update Elo ratings in the dictionary
        elo_ratings[team1] += elo_change1
        elo_ratings[team2] += elo_change2

        # Also update the Elo ratings and expected outcomes in the DataFrame
        nba.at[index, 'team1_rating'] = team1_rating
        nba.at[index, 'team2_rating'] = team2_rating
        nba.at[index, 'team1_newrating'] = elo_ratings[team1]
        nba.at[index, 'team2_newrating'] = elo_ratings[team2]
        nba.at[index, 'team1_expected'] = expected1
        nba.at[index, 'team2_expected'] = expected2
        nba.at[index, 'outcometeam1'] = outcome_team1
        nba.at[index, 'outcometeam2'] = outcome_team2
    
    return elo_ratings

# Extract unique teams
unique_teams = pd.concat([nba['team1'], nba['team2']]).astype(str).str.strip().unique()

# Initialize Elo ratings dictionary
elo_ratings = {team: 1000 for team in unique_teams}

# Initialize Elo ratings columns in the matches DataFrame
nba['team1_rating'] = nba['team1'].astype(str).map(lambda x: elo_ratings.get(x.strip(), 1000)).astype('float64')
nba['team2_rating'] = nba['team2'].astype(str).map(lambda x: elo_ratings.get(x.strip(), 1000)).astype('float64')
nba['team1_newrating'] = None
nba['team2_newrating'] = None
nba['team1_expected'] = None
nba['team2_expected'] = None
nba['outcometeam1'] = None
nba['outcometeam2'] = None

# Update Elo ratings based on matches data
elo_ratings = update_elo_ratings(nba, elo_ratings)

# Display updated Elo ratings
for team, rating in elo_ratings.items():
    print(f"{team}: {rating}")

# Print the updated DataFrame
print(nba[['id', 'team1', 'team2', 'team1_rating', 'team2_rating', 'team1_newrating', 'team2_newrating', 'team1_expected', 'team2_expected', 'outcometeam1', 'outcometeam2']])

# Save the updated DataFrame to a CSV file
output_file = 'nbaeloo.csv'
nba.to_csv(output_file, index=False)

The main function (`update_elo_ratings`) iterates through each match in the NBA DataFrame, determines the outcome (win, loss, or draw) for each team, calculates Elo rating changes using the previously defined functions, and updates the Elo ratings accordingly. It also updates the DataFrame with columns for current and updated Elo ratings, expected outcomes, and match results.

After updating the Elo ratings for all matches, the script prints out the final Elo ratings for each team and displays a subset of the updated DataFrame containing Elo-related columns. Finally, it saves the entire updated DataFrame to a new CSV file (`nbaeloo.csv`) for further analysis or reporting purposes.

This script is designed to provide a comprehensive analysis of NBA match data using Elo ratings, adjusting the ratings dynamically based on match outcomes and playoff significance, thus reflecting the teams’ performance over time accurately.

The comparison of the new Elo ratings to the old ratings for NBA teams reveals intriguing insights into their recent performances and overall competitiveness. Several teams, including the Boston Celtics (BOS), Denver Nuggets (DEN), and Dallas Mavericks (DAL), have notably increased their ratings, suggesting strong recent performances that have bolstered their standings. Conversely, teams like the Detroit Pistons (DET), Charlotte Hornets (CHO), and San Antonio Spurs (SAS) show relatively stable ratings, indicating consistent but not dramatically shifting performance levels over the analyzed period. This stability suggests these teams have maintained a competitive balance without significant fluctuations in their Elo ratings.

Among teams showing substantial increases, the Minnesota Timberwolves (MIN), Oklahoma City Thunder (OKC), and Phoenix Suns (PHO) stand out, indicating significant recent successes that have improved their Elo ratings markedly. This upward trend suggests these teams have performed well against opponents with higher ratings, potentially signalling their rising competitiveness or strategic improvements. In contrast, teams like the Washington Wizards (WAS) and Sacramento Kings (SAC) demonstrate more modest changes, reflecting consistent failures in recent years to make the playoffs.

The comparison also underscores the competitive balance among top-rated teams like the Los Angeles Lakers (LAL), Milwaukee Bucks (MIL), and Golden State Warriors (GSW), who maintain consistently high Elo ratings. These teams are positioned as strong contenders with proven track records of competitive performance. Conversely, middle-tier teams such as the Chicago Bulls (CHI), Houston Rockets (HOU), and New Orleans Pelicans (NOP) exhibit stable ratings, suggesting competitive play but with room for improvement to reach elite status.

Overall, the new Elo ratings offer valuable insights into team dynamics and playoff potential in the NBA. They provide a snapshot of recent team performances, indicating trends in competitiveness, strategic strengths, and areas for development. Coaches, analysts, and fans can use these ratings to assess team strategies, predict playoff potentials, and understand the evolving competitive landscape of professional basketball.

In summary, predicting NBA game outcomes using player-based performance measures offers a nuanced and detailed understanding of individual contributions and matchups. This approach provides a granular analysis, allowing for insights into how specific player interactions and roles impact the overall game. The use of raw player stats, defensive ratings, and various in-game metrics enables the model to capture the complexities of basketball dynamics. Moreover, simulations and Monte Carlo analysis enhance prediction accuracy by accounting for numerous game scenarios and series outcomes.

However, it’s important to consider the pros and cons of different models. The player-based model, while detailed, can be computationally intensive and requires extensive data on individual performances. It excels in providing insights into player contributions but might struggle with incorporating team chemistry and synergy effects comprehensively.

In contrast, the team-based model aggregates data at the team level, simplifying the analysis and focusing on team dynamics rather than individual performances. This model is less data-intensive and can be easier to implement. However, it might miss critical individual contributions that significantly impact game outcomes, especially in games where star players’ performances are pivotal.

The Elo rating model offers a straightforward and intuitive approach by assigning ratings to teams based on their performance and adjusting these ratings after each game. It’s less data-intensive and provides a quick way to gauge team strength. Nevertheless, it might oversimplify the complexities of basketball by not accounting for individual player performances and in-game events that can influence outcomes.

Through the comparison, it becomes evident that each model has its strengths and limitations. The player-based model offers depth and detail, the team-based model provides a broader overview, and the Elo rating model delivers simplicity and ease of use. By integrating elements from each approach, a more robust and comprehensive prediction model can be developed, leveraging the detailed insights from player-based analysis, the holistic view from team-based models, and the simplicity of the Elo rating system.

This comprehensive approach to predicting NBA game outcomes not only improves accuracy but also enriches our understanding of the game, highlighting the intricate balance between individual brilliance and team dynamics. As evidenced by its close alignment with actual NBA playoff results, this multi-faceted model demonstrates its effectiveness and potential for future applications in sports analytics.