Investigating Baseball Wins

Understanding what it takes to win baseball games using statistical techniques

Aadit Sheth
The Sports Scientist
9 min readJun 8, 2020

--

Major League Baseball (MLB) is the oldest major professional sports leagues in North America. Since its founding in 1903, the game of baseball has changed significantly through various lenses. Not only has the actual baseball changed since the dead-ball era (pre-1920), but the game has evolved with several rule changes, the influence of technology and the increasing growth of baseball statistics aka Sabermetrics. The quantitative approach to baseball gained recognition in Michael Lewis’ book and film “Moneyball” and is now a common avenue for evaluating team and player performances. This article focuses on team statistics and attempts to build a model to predict performance of a given team using traditional on-field baseball statistics. All modelling is based on data from the ‘Lahman’Baseball Database’; the data was cleaned and all models were built on post-1950 data.

Runs buy Wins

Figure 1: Influence of run differential on winning baseball games

Just as in all team sports, assembling a winning team is contingent on the synergy between the on-field player performances and off-field management decisions. In baseball, the on-field contribution is heavily dependent on run differential; run differential is simply the runs scored minus the runs allowed by a team. As one would guess, the more runs a team scores and the less it surrenders depicts the quality of team’s offence and defence respectively. Figure 1 plots the winning percentage and run differential per game for all teams across 70 seasons. The crystal-clear message from the plot is that runs generate wins. Irrespective of league, one can see that the higher the run differential per game, the more wins it generates in a season. Another intriguing observation is the intercept point of the line which regresses winning to run differential; when run differential is zero on average, the team plays par baseball i.e. has an approximate .500 winning record.

With the correlation between run differential and winning being approximately 95%, the covariates explaining the run differential should also suffice in explaining winning percentage. Thus, we decide to model run differential by splitting it into two sub-models, one for runs scored and the other for runs allowed. This aligns closely to what Oakland A’s assistant GM, Peter Brand said in Moneyball, “Your goal shouldn’t be to buy players. Your goal should be to buy wins. In order to buy wins, you need to buy runs.”

Swing harder, Miss more

Amid the process of building a model for runs hit, there was a key observation- one that has been talked a lot about since the 2019 season: the home run surge. In the MLB’s all-time single-season home run list, the top 4 teams registered their record in 2019 (Twins, Yankees, Astros and Dodgers) and fifth on the list are the Yankees from the 2018 season. As the number of home runs hit have escalated, so have the number of strike-outs. Batters are now swinging “more wildly” and when a batter is at the plate, he is now more likely to strike-out than get a hit, which is a very recent trend.

Figure 2: Batters are striking out at an increasing rate in recent years

Figure 2 shows the rapid rise in the number of strikeouts per game over the last century; in fact, the strikeout per game ratio has risen for 14 straight seasons. To exacerbate the situation, the strikeout ratio has risen drastically- in 1981, on average a team would strikeout 4.7 times per game and in 2019 this has risen to 8.8, that is a staggering 90% higher strikeout per game ratio. Figure 3 shows the relationship between home runs and strikeouts as well as illustrates the emerging trend of rising home runs and strikeouts.

Figure 3: Shift in baseball strategy over the last 100 years

Modelling Runs

Having constructed the ordinary least squares (OLS) regression model for runs scored by a team, the following variables were significant and included in the model: total bases (weighted sum of hits i.e. 1 for a single, 2 for a double, 3 for a triple and 4 for a home run), strikeouts, walks and stealing percent (steals divided by stealing attempts). Similarly, various regression models were built to model the runs allowed by a team in a single season. The most significant variables were: earned run average (ERA), strikeout to walk ratio, save percentage, errors, complete games and shutouts per season.

For the statistic nerds, all models were cross-validated on a 70% training dataset and once the top couple of models were selected, the models were tested on the remaining 30% holdout dataset- this was done to prevent over-fitting the models. When it came to model selection, the variables were selected based on various model diagnostics:

  • statistical significance (p-values) of the variables included in the model
  • Akaike Information Criterion (AIC) was used to evaluate the parsimony of the model
  • Variance Inflation Factors (VIF) was used to ensure that there was no sign of multicollinearity between model variables
  • adjusted R-squared was used to evaluate the goodness-of-fit of the model adjusted for the number of predictors in the model

Once the variables had been selected, the residuals were assessed and a normality assumption check was carried out. Finally, the model predictions were compared against the actual values via metrics such as the root mean squared error (RMSE) and mean absolute error (MAE).

The final model for team wins per season was composed by integrating the variables from the runs scored and runs allowed models. As mentioned earlier, baseball is more than just a sport played on the field; successful baseball teams rely on various other factors such as management decisions, team payrolls, scouting efficiency and many other. Since the focal point of this article is on-field team performance, elements such as attendance, park factors, coaching staff and others do impact every team’s success. Keeping this in mind, variables such as attendance per game, overall park factors were considered in the final model. When ball-park attendance was analyzed, there was a clear correlation between winning and attendance numbers. In the last decade, the top three teams in winning percentage are the New York Yankees, Los Angeles Dodgers and the St Louis Cardinals; unsurprisingly, the top three in attendance were the Dodgers, Cardinals and Yankees- different order, but same teams.

Figure 4: Summary of the predictors used in the final regression model

Therefore, the final model for winning percent was built by incorporating the offence and defence models with miscellaneous variables. The summary of the final regression model is shown in figure 4. The adjusted R-squared is 0.889, this means the model fits the data at around 90% accuracy, which suggests that the variables used can predict the number of wins quite well. In addition, the root mean squared error (RMSE) and mean absolute error (MAE) is very low; RMSE of the model is 0.024 and MAE is 0.020. Such low values imply that the difference between the data points’ observed and predicted values is small. It would make sense to interpret a few significant predictors. From the summary, we can deduce the following (assuming a 162-game season):

  • all else equal, one extra base hit every game can result in approximately 4.75 (0.02931 x 1 x 162) more wins in a season
  • all else equal, reducing the team ERA by 0.5 can result in 5 (-0.06174 x -0.5 x 162) more wins in a season
  • all else equal, one more base walked every single game can result in close to 3.7 more wins in a season

Building Dream Teams

One hypothetical way to use such a model is by predicting the number of wins an all-star team would collect in a season based on the accumulation of various stats from the top players in the league. This is performed using the all-star lineups of both the American and National league; starting pitchers and relievers who played a role in the all-star game are selected and the starting hitters are chosen. For the attendance and park factors, the average value across the conference in each year is used.

Figure 5: Comparing hypothetical all-star team wins against the league

Figure 5 shows the results for the two conferences over the last few years; the bars represent the expected wins by the all-star lineup, the bold line depicts the maximum games won by the best team in the respective conferences and the lighter line displays the average wins (close to .500). The plot illustrates that all-star teams are more superior than the best teams in the league, as expected. Since the 1950s, only three teams have played over .700- the Cleveland Indians in 1954, the Yankees in 1998 and the Seattle Mariners in 2001. The Mariners hold the record for 116 wins since the season was expanded to 162 games. Comparing the 116 games to figure 5, emphasizes how well the Mariners must have played over 162 games in 2001; 116 wins is close to the number of wins many all-star teams would achieve.

There are two towering bars that catch the eye: the American League all-star team in 2005 and the National League all-star team in 2015. Had these teams played 162 games, they would have both won a record-breaking 138 games (0.852 winning percent). Investigating these two teams further suggests that both teams would record such high wins for different reasons. In 2005, the AL all-star team had four batters in the starting lineup who smashed more than 40 homers: Alex Rodriguez (48), David Ortiz (47), Manny Ramirez (45) and Mark Teixeira (43). More home runs lead to more total bases thus a higher winning percentage. This is illustrated in figure 6 which shows that the 2005 AL all-star team had 291 home runs and 990 RBIs; the 990 runs driven would be the most ever accumulated by an all-star team.

Figure 6: Key statistics from the various all-star teams

On the other hand, the 2015 NL all-star team was theoretically successful due to its great pitching rotation and bullpen. The 2015 NL all-star team had the lowest ERA of all all-star teams at 2.24. The pitching rotation included the following star-studded pitchers: Zack Greinke, Madison Bumgarner, Gerrit Cole, Jacob deGrom and Clayton Kershaw. During the 2015 season, Greinke finished with a 19–3 record and had an ERA of 1.66- lowest ERA since Hall of Famer Greg Maddux in 1995. Despite a low earned-run average and an impressive record, Greinke lost out to Jake Arrieta in the 2015 National League Cy Young award in one of the greatest Cy Young fields ever. This provides evidence to suggest that the 2015 NL all-star team would have pitched themselves to 138 wins that season- of course hypothetically.

Despite a relatively good model fit and having experimented the model using all-star teams, there are other practical considerations such as the team payroll- it would be impossible for any franchise to facilitate an all-star team. In addition, mental fortitude is key: player performances in extra innings is crucial and a team’s approach when they have a runner in scoring position is pivotal to winning games. From this article, the major takeaways are understanding the influence run differential has on winning games. It is very unlikely that franchises can afford to build a prominent, all-star lineup, however the regression modelling has accentuated the impact of building a team around hitters who can hit for more bases and pitchers who are stingy and economical with low ERAs.

--

--