Pythagorean Expectation: A Guide to Estimating Win Probabilities in any Sport

Div Tiwari
6 min readDec 19, 2022

--

Pythagorean expectation is a statistical model commonly used in baseball to estimate a team’s expected winning percentage based on their runs scored and runs allowed. It was initially developed by Bill James, a pioneer in the field of baseball analytics (aka Sabermetrics), and has since become a widely accepted tool in the sport for evaluating team performance.

However, the application of Pythagorean expectation is not limited to just baseball. In fact, it can be generalized to estimate win probabilities in any sport where the final score is determined by the difference between two teams’ points or goals. This includes sports such as basketball, hockey, soccer, and many others. With some caveats, it can also be applied to cricket.

In this post, we will explore the concept of Pythagorean expectation and how it can be applied to estimate win probabilities in any sport. In the spirit of sensible data science, I will present a mathematical derivation followed by a full example for data from ODI and T20 cricket. By the end of this post, you should have an understanding of how to use Pythagorean expectation to make informed predictions of the expected win percentage for sports teams.

The Idea

Here is the original formula, which holds for baseball:

The exponent of 2 is what what lends this model its name — it is reminiscent of Pythagoras’ theorem, which relates the side lengths of a right-angled triangle.

A team whose actual win ratio is greater than this prediction has won more than it “should”, and vice-versa. We can arguably expect teams to ‘regress to the mean’ that is predicted over a long enough time. The appeal for applying the formula to the sport of your choice is obvious. The question is — does it still work for another sport?

I have good news and bad news. The bad news is that the formula shown above is a terribly poor predictor for a different sport like cricket. The good news is that it can be ‘fixed’ by simply changing the exponent. The question now is — how do we calculate the correct exponent for another sport?

The Math

Here, I will derive an equation that gives us the exponent to use in the Pythagorean expectation formula for a different sport. This section is optional; I recognize that not all share my enthusiasm for mathematics.

Assume that p is the exponent that correctly predicts the Win Ratio. I will use R to denote Runs scored and R_A for Runs allowed. We have

Win Ratio is Wins/(Wins + Losses). Using W and L for Wins and Losses, respectively, a series of simple manipulations gives us:

The ratio of losses-to-wins has a power law relationship with the ratio of runs allowed-to-runs scored. Invert the equality so it use the more common wins-to-losses ratio and then take the logarithm:

There it is! The desired exponent p can be determined from a linear fit of two quantities, both of which we can calculate easily from data.

The Data Analysis

In this section, I will apply the mathematics above to derive a Pythagorean expectation model for limited-overs international cricket, i.e. the ODI and T20I formats of cricket. This will be a three-step process:

  1. Data Wrangling — create a suitable dataset
  2. Model fitting — fit the data to the model derived above to determine the exponent p
  3. Assessment — evaluate how good the models’ predictions are

I’ve used MATLAB for this process and for creating the plots shown below.

Data Wrangling

Using international cricket data presents challenges, such as:

  • There is no ‘every team plays every other team’ league, which can introduce bias in the dataset. An associate member of ICC like Oman would rarely, if ever, play a team like India or Australia.
  • Teams change over a course of time, and so does their corresponding data. For example, Afghanistan has become much better in recent years than it used to be. Their data from 2012 shouldn’t be merged with that from 2022.

To get around the first one, I limit the dataset to matches involving the twelve full ICC members. For the second, we could split the dataset by calendar year. However, this can leave very few matches played by individual teams so the win-loss ratio becomes prone to statistical noise. So I split the T20I data in a period of two years (Jan 2021-Dec 2022) and then three year periods going back to 2008. This gave each team around 25–30 matches in each split. We end up with 60 data points in total.

For ODIs, I’ve used 4-year splits (this makes intuitive sense to me as the period between World Cups) and have 40 data points.

I excluded all ‘No Result’ matches, and Ties count 0.5 to both Wins and Losses.

Model fitting

Our model (derived above) is linear with just one coefficient, of the form y = p*x:

Here is the log-log plot of the runs scored-runs conceded ratios and win-loss ratios for T20Is, and the model fit statistics:

Scatterplot estimating the Pythagorean exponent for T20I, showing a best-fit line with a slope of 8.3
Linear model:
f(x) = p*x
Coefficient (with 95% confidence bounds):
p = 8.305 (6.813, 9.798)

Goodness of fit:
SSE: 9.269
R-square: 0.6774
Adjusted R-square: 0.6774
RMSE: 0.3964

The same for ODIs:

Scatterplot estimating the Pythagorean exponent for ODI, showing a best-fit line with a slope of 11.1
Linear model:
f(x) = p*x
Coefficient (with 95% confidence bounds):
p = 11.14 (9.732, 12.54)

Goodness of fit:
SSE: 6.025
R-square: 0.8587
Adjusted R-square: 0.8587
RMSE: 0.3931

Assessment

Eyeballing the best-fit plots and the R-square values show that we have quite good models. It is best to convert the predictions back to a metric we are familiar with (in this case the Win %) and perhaps do some statistical analysis of the residuals (the difference between the predicted and actual value for each datapoint).

Here are the Predicted vs Actual Win % plots for T20I and ODI, along with a y = x line:

Scatterplot assessing the Pythagorean prediction for T20I
Scatterplot assessing the Pythagorean prediction for ODI

I am quite happy with these predictions! Most predictions are no more than 0.10 from the actual value (10% in terms of Win% instead of Win Ratio). The RMS (root mean square) of the residuals is 0.087 for T20I and 0.067 for ODI.

Closing Thoughts

We introduced the Pythagorean expectation formula used in baseball analytics and derived how it can be generalized to other sports. We applied that to construct Pythagorean expectation models for T20I and ODI cricket.

As good answers often do, this may have raised some more questions in your mind. Why are the exponent values what they are? What does it mean for the value to be bigger (or smaller)? In what other ways can we use this model? Are there other simple models for predicting Win% in cricket? I invite you to share your thoughts in the comments. I plan to explore these questions in later posts.

If you liked this post, you may also enjoy my other posts on data science applied to cricket (and sports in general).

--

--

Div Tiwari

I am Div Tiwari, an engineer who enjoys using mathematics and computation to better understand how the world works.