Revisiting the Pythagorean Expectations
According to Wikipedia, Pythagorean expectation is a sports analytics formula used to estimate the percentage of games a baseball team “should” have won based on the number of runs scored and allowed. The assumption is that baseball teams win in proportion to their “quality”, and that their “quality” is measured by the ratio of their runs scored to their runs allowed. If this is true, the probability of wins can be rewritten as:
Slight simplifications to the formula converts it to
The same fundamental can be extended to NBA as well. We replace runs by points and re-tune the exponent co-efficient. Historically, it has been found to lie somewhere in the range of 13 to 17 for the NBA.
Firstly, what does a different exponent value mean?
Table 1 shows how the win percentage changes as the exponent is changed. As you run down the table, the exponent increases. As we move away from the central column (score ratio = 1), the scoring ratio increases/decreases. As we run down the table, a small change in the scoring ratio away from 1 leads to a larger change in the win percentages. This is in line with our understanding of the two sports as well. In MLB, the variation between win percentages isn’t very high. The best teams win somewhere in the range of 105 games which is only 65%. In the NBA, the best teams win close to 67 games which is 83% of the total games in a season. The coefficient for the NBA is high also because of the nature of the two sports. While basketball teams score roughly 110 points per game, baseball teams score close to 4.5 runs per game. You still only need only one extra point to win a basketball game. Therefore, a 5% increase in scoring in basketball will lead to a lot more points and potentially, lot more wins.
The purpose of this article is to improve the win expectation formulae by going more specific, based on what we know about the quality of the divisions/conferences in the sports.
We know that point scoring in the NBA has changed a lot in the last couple of years. So, I’ll use only the previous two seasons; 2016–17 and 2017–18. Simulating the errors at different exponent values shows that the root mean square error (RMSE) minimizes itself at 5.45% when the exponent is 14.23. For reference, basketball-reference.com uses 14, ESPN uses 16.5.
The NBA has shifted focus to a per 100 possession basis. Let us see if moving the computation to a per 100 basis would improve the win projection. I don’t expect a lot of difference because most matches end up having similar number of possessions. Still, it might be worth checking. Repeating the process using per 100 possession stats leads us to almost the same RMSE. It improved by 0.005 wins, if that counts.
Additionally, the Eastern Conference has been much weaker than the Western Conference. In those two seasons, teams from the Western Conference beat their Eastern Conference counterparts in 53.44% of the games. Let us see if there is merit in having different exponents for inter-conference games. Table 2 shows the results.
Breaking the pythagorean formula by conference reduces the RMSE from 5.45% to 5.18%. This is an equivalent of 0.22 wins. The exponents themselves make for interesting reading. The coefficient is highest for games within the eastern conference. Since the teams in the east are more closely bunched, a small points advantage leads to more wins. The coefficient is lowest for the inter-conference games where a small increase in points doesn’t translate into as many wins because of the difference in the quality between the teams.
Similar to NBA, there is no consensus regarding the exponent to be used in the MLB win expectation formula. To give you an example, Baseball reference uses 1.83 while ESPN uses 2. To make the calculations for the exponent more granular, it would be ideal if we could somehow isolate the weaker divisions from the stronger ones. Fortunately, MLB does not have a perennially dominant division and predominantly weak division. It is hard to argue that one of the six divisions has been exceptionally strong or weak for a long time. Competitiveness for MLB teams that do not spend exorbitant amounts is cyclic in nature. Also, teams play very few games against teams from a division of the other league. Couple these two facts together and it makes sense not to use exponents at a league-division level.
My next instinct was to take a similar route as we took for the NBA. Just like we had different exponents for different NBA conferences, I tried to see if different exponents for NL and AL would improve the calculations. For this, I used the previous 9 seasons (2010–2018) of game logs from Retrosheet.org. Over that period, the overall error is 2.47% at 1.83 and 2.63% at 2. The projections at baseball-reference work better. Table 3 shows the results
As you can see, the ploy is unsuccessful for the MLB. We end up increasing the error. That is mostly because of the low sample sizes in the inter-league games. Only 11.7% of the games involved one NL team and one AL team. With neither of the two leagues having a consistently better quality, it is probably better that this did not work. Even if the results were preferable currently, they might not hold for the near future as the competitive balance continually swings from one side to the other.
But there is something interesting to note in table 3. All of the exponents are below 1.83. This might indicate that even 1.83 is a high exponent value. I re-did the entire exercise at the overall level (same as the league tables that we see) and found that the RMSE was minimized when the exponent was 1.74. Image 1 shows the exponent values for optimal RMSE every year in that period.
The orange line represents 1.83, which is the exponent used by baseball-reference. The blue line, which represents the exponent for optimal RMSE has been below the orange line for the last 6 years now. This almost exactly coincides with the wide-spread introduction of defensive shifts in MLB. Shifts have reduced run-scoring in the MLB. That might have brought the exponent values further lower than where we thought they should be. The optimal RMSE for that period is 1.72. Changing the exponent from 1.83 to 1.74 improves the predictions by 0.04 wins. This isn’t significant enough to demand a widespread change in the formula. Nevertheless, it does remind us to constantly monitor the value, especially if the sport takes a major turn some time in the near future.