Stat Talk: What’s Wrong with WAR (Wins Above Replacement)?

Joel Kupfersmid, PhD

The WAR statistic has a laudable objective — to approximate the number of wins or losses if a player is in the line-up over his replacement. This statistic tries to meet this objective regardless of the player’s team, home ballpark, or league. Two popular computations are offered, one by FanBase (fWAR) the other by Baseball Reference (rWAR). Each uses slightly different formulas, but for this discussion these are inconsequential.

I address problems with the WAR statistic for position players and conclude this measure is seriously flawed. The WAR is not team or ballpark neutral (i.e., context free) and the variables lack empirical support.

Three categories of variables are computed to generate a player’s WAR. One is his hitting proficiency. These measures include batting average (BA), runs batter in (RBIs), and on base percent (OBP). A second category of measures is a player’s speed derived by their success at stealing bases, getting on base via infield singles or errors, going from first to third, and tagging up on outfield flies. The third variable category is a player’s defensive ranking which are estimates of his ability to prevent the opponent from scoring runs. These include a judgment of a fielder’s range, the strength of his throwing arm, and the percentage of physical and mental errors ( e.g., throwing to the wrong base). For many of these measures a player’s numbers are compared with the league average.

**WAR Statistics and Context**

In computing the WAR, sabermatrician try to statistically eliminate the influences of a player’s teammates and the ballparks they play in. Even if possible, the problem remains that measures are often not isolated or self contained. Variables cannot be added together to produce an accurate outcome because many measure effect other measures (i.e., variable are interactive).

Of the three categories, measures of players’ speed are the most objective. One’s speed does not change regardless of team or league. Going from first to third or second to home remains the same distance regardless of ballpark. What may change is the number of attempted steals depending on the manager’s strategy, those hitting after a player, and the player’s OBP.

To steal, one must get on base. The OBP is the best estimate to do this. OBP is not context free. Once on base, managers have their bias regarding the green light to steal. Sometimes their decision is based on the ability of the next batter to successfully bunt or hit-and-run. Other times it is due to a preference for scoring runs via the extra base hit rather than risk a base running out. Players with below average speed will rarely be given the steal sign or used in hit-and-run situations regardless of their teammates’ skills or their manager’s preferences. Only those with above average speed are usually affected by these parameters.

Several measures of defensive skills are judgment calls, including a player’s range and number of mental errors. These variables, however, are generally context free (i.e., not dependent on the player’s team or ballparks).

For many players hitting proficiency is often affected by context, including the team played for, one’s position in the batting order, and his home ballpark.

Let’s start with the obvious, ballpark effects. Sabermetricians have calculated this effect, designating some as pitcher friendly and others as hitter friendly. This is a clear acknowledgement that the measures of hitting, including BA, OBP, RBIs, runs scored (RS), on base percent plus slugging (OPS), and home runs (HRs) are influenced by a player’s home field.

The effects of one’s position in the batting order on hitting are calculated by Tanner Bell of Smart Fantasy Baseball (October 24, 2014). For example, the opportunity to generate runs (either by scoring them or scoring others) is affected by one’s plate appearances (PA). The typical first batter has 750 PAs per season, the fourth batter 700, and the seventh batter 650.

In 162 games the average first batter scores 95 runs, the fourth batter 80 runs, and the seventh batter 60 runs. Scoring runs is a product of a player’s opportunities (i.e., his PAs), ability to get on base (OBP), the hitting competencies of those batting after him, and his speed. It is no surprise that often the first batter in the line up scores the most runs, having the highest PA and the best hitters batting after him.

Also of no surprise is the paucity of RBIs generated by the typical first batter. These players are guaranteed no one will be on base the first time they bat. For the remainder of the game they rely on the weakest hitters in the line-up to get on base. The typical first batter generates 60 RBIs per season, the fourth batter 90, and the seventh batter 65.

It would be a statistician’s nightmare to calculate the simultaneous interactions in hitting that actually occur from, at the very least, ballpark effects, batting order effects, and the effects one’s teammates hitting on a player’s BA, OBP, RS, and RBIs.

As a simple example, consider before the Steroid Era only two players hit 60 or more HRs in a season. Both Ruth and Maris shared several commonalities. Yankee stadium was their home field. Both hit left handed and faced the shortest right field fence in baseball. No Yankee right handed batter, including Hall of Famers DiMaggio and Winfield, ever hit 50 HRs. This suggests a significant stadium effect. There is also a batting order effect for the two. Both hit third with the fourth batter having a better season than either Ruth or Maris. In 1927, when Ruth hit 60 HRs (BA .356), batting behind him was Gehrig (BA .373, HR 47). The MVP that year was not Ruth, it was Gehrig. In 1961, hitting behind Maris (BA .269) was Mantle (BA .317 HR 54). In both cases pitchers had to pick their poison — avoid walking Ruth and Maris, but not giving Gehrig and Mantle anything good to hit. Or, not giving Ruth and Maris good pitches, but pitching to Gehrig and Mantle with Ruth or Maris on first. In 1961 pitchers choose to pitch to Maris and avoid Mantle. Playing in 161 games, Maris received no intentional walks (IBB) that year where as Mantle, in 153 games, received nine. This statistic was not recorded in 1927.

Additionally there is the watered down pitching in 1961.That year the American League expanded from 8 teams to 10 teams. In 1960 the 8 teams averaged 136 HRs, in 1961 this average climbed to 153 per team.

The possible combinations for each one measure of hitting, like RBIs, would include the number of ballparks played in and their respective effects (weighted by the number of games in each park), the player’s order batting, and the hitting proficiency (measured by some weighted combinations of BA, OPS, etc.) of the next two or three teammates in the line-up. Even if these variables are accurately measured individually, the interactions between each would be unknown. A statistician could not simply add up each measure but, instead, would have to consider their effects on the other measures. To use an example outside of baseball, for four medications there are a possible 24 interactions (4 x 3 x 2 x 1 = 24). Your physician can tell you little about each medication’s interaction with the others. They can only say that, so far as he/she knows, these drugs in combination have not yet killed anybody.

**WAR and Empirical Support**

Even if the flaws noted in the previous section are resolved, there remain bigger problems. Specifically, an absence of a gold standard by which to assess the accuracy of any single variable used to compute the WAR or the accuracy of its final number.

The WAR is concerned with *wins*. Yet, the measures used, and statistics computed, generate an estimate of *runs*. Runs are then transformed into an estimate of wins, typically with 10 runs equal to one win. While intuitively obvious that more runs scored and less runs allowed results in more wins, it is not intuitively known the contribution (i.e., statistical weight) of each WAR measure to runs scored and runs allowed.

For example, how much influence (i.e., statistical weight in a formula) does the difference in a player’s BA, over his replacement’s BA (his real replacement or the hypothetical replacement used in computing his WAR) have in generating team wins? No one knows because the statisticians do not use wins as their criterion (i.e., dependent variable). To my knowledge they have not even taken all their predictor variables and correlated these measures (i.e., a regression) with a criterion of runs scored minus runs allowed to determine the relative contributions (or lack of it) for each predictor. The correlation each variable used to calculate a player’s WAR has not been shown to have any relationship to runs, much less to wins.

A second problem is the *replacement* player used to calculate the WAR. Both rWAR and fWAR do not use a player’s actual replacement. In 1924 the Yankee first baseman, Wally Pipp, has a WAR is 3.1. His actual replacement, which occurred the next year, is Lou Gehrig. Pipp’s WAR of 3.1 does *not* mean that the Yankees would win three more games that year with Pipp in the line-up versus Gehrig. Rather, it means that if some *hypothetical *player was Pipp’s substitute, the Yankees would have had three less wins that season.

Who is this hypothetical player and why is a non-existent entity used to predict real baseball outcomes? This hypothetical player is a statistical avatar. Originally “he” was given league average numbers on measures used to compute the WAR. More recently, “his” measures are in the league low average range. The rationale for using a chimera is to create a WAR that is context free. Creating a hypothetical player replacement is a common metric — each player being compared to the same standard substitute.

The problem with this approach is two fold. First, many measures used to generate a player’s WAR are context dependent, especially their hitting prowess (BA, RBIs, etc.). Thus, a context free non-existent replacement’s WAR is compared to a context dependent real player’s WAR. Second, when making trades, offering big contracts, and deciding line-ups a team needs to know specifically how a player will help them win rather than an alternative real player on their roster. The current WAR gives this appearance, but it is an illusion.

Why don’t statisticians use each player’s actual replacement to compute their WAR? This creates a context dependent statistic and statisticians want to avoid this. Also, there can be disagreement about one’s replacement. Often more than one replacement occurs, such as a right handed and a left handed batter are platooned. For these replacements their WAR would need to be contrasted to the regular player’s WAR. But, to compute the replacement’s WAR would necessitate knowing his replacement’s WAR. At this point the WAR goes from a questionable measure to comical.

Is there a way out of this morass? Needed is a measure of wins (not runs) when a player is in the line-up and when he is not. While not context free, this measure is based on what actually occurs in baseball.

**The jk-40WAR**

I computed a WAR taking a team’s percentage of wins when a player is in the line-up and when he is not. Thus, actual wins is the criterion against one’s actual replacement. I immodestly christened this statistic the jk-40WAR (i.e., Joel Kupfersmid — 40 Game Wins Above Replacement). To equalize (create a common metric) comparisons between players, the jk-40WAR is based on missing 40 games. Forty games were selected as long enough for a player’s absent to be highly influential in a team’s wins or loses.

The jk-40WAR involves: (1) only position players, (2) the player missed between 25 to 75 games, (3) had at least 300 PAs, and (4) played on one team for the 2015 season. These requirements filter out most part-time players, yet allows enough missed games and enough games played to compute an accurate assessment of the differential percentage of wins in and out of the line-up for each player.

The jk-40WAR is team specific. Good teams win in the absence of one quality player. Bad teams lose with a quality player in their line up. This statistic is specific to the substitute player available. Two players of equal ability may have markedly different jk-40WARs if their substitutes have different abilities. Likewise, a good player can have a deflated jk-40WAR if he is out of the line up when another good player, including pitchers, is also injured. All these limitations are what actually occur in real baseball.

Table 1 presents the jk-40WAR for all 121 eligible players in 2015. Using Lucas Duda as an example, the percentage of games won when he was in the line up (54.07%) is subtracted from the percentage of team wins when he was out of the line up (62.96%). This differential percentage, -8.89 is multiplied by 162 games (162 * -.0889 = -14.40) to produce the number of differential wins or loses for the season. This product is then divided by 4.05 because 40 games times 4.05 equals 162 games. The result is -3.56, indicating that when Duda starts his team loses approximately 3.56 games over the course of 40 games versus the use of his replacement(s). As a warning, the jk-40WAR for many players is similar to Duda and some stretch baseball statistical credulity.

Table 1: jk-40WAR & rWAR for Position Players (2015 Season)

___________________________________________________________________

Players……………/ % Wins Played vs. %Wins Missed/ jk40-WAR/ rWAR

Duda, Lucas……….….-8.89%…………………………-3.56…………3.00

Murphy, Daniel………-8.65%………………………….-3.46…………1.40

Flores, Wilmer……… -9.99%………………………….-3.99………….0.80

Cuddyer, Michael……-9.23%…………………………-3.69……….…-3.69

Tejada, Ruben……… -7.42%………………..…………2.97……….…-0.10

Ramos, Wilson……… 9.01% …………………………..3.60………….0.80

Zimmerman, Ryan… -1.71%………………………….-0.68………….-0.10

Espinosa, Danny……-1.43%…………………………..-0.57……….…1.80

Werth, Jason………..-0.21%…………………………..-0.09…………-1.60

Robinson, Clint……..-9.13%…………………………..-3.65………….0.20

Realmuto, J.T………. 6.35%………………… ………..2.54………….2.20

Bour, Justin …………8.29%…………………. ………..3.32………… 0.30

Hechavarria, Adeiny -11.59%………………………….-4.63…………2.10

Prado, Martin………. 5.57%…………………. ………..2.23…………3.10

Yelich, Christian……-15.08%……………………….…-6.03…………3.50

Ozuna, Marcell………10.44%………………………… 4.18…..……. 0.40

Pierzynski, A.J. ………-5.07%…………………………-2.03…..…… 1.60

Freeman, Freddie ……-2.50% …………………………-1.00……..…3.40

Howard, Ryan ………-20.37%…………………………-8.15……....-1.40

Hernandez, Cesar……-12.85%……………………..…-5.14……..…0.90

Asche, Cody …………-27.98%……………………….-11.19….…...-1.10

Francoeur, Jeff ……….13.35%………………………..-5.34……….-1.10

Molina, Yadier ………. 18.55%………………………...7.42………..1.40

Grichuck, Randal ……. -4.21%………………………..-1.69………..3.20

Cervelli, Francisco …..-14.18%……………………..…-5.67…….….3.10

Mercer, Jordy ………….. 2.51%………………………..1.00…….….0.30

Kang, Jung Ho ………… -7.94%……………………….-3.17…….…4.00

Harrison, Josh ………… -5.81%……………………….-2.32……….1.80

Montero, Miquel ………-10.71%………………………-4.28……….1.80

Soler, Jorge ……………...-1.25%………………………-0.50………-0.10

Lucroy, Jonathan ………..10.04%……………………….4.02……....1.00

Gennett, Scooter …………-2.52%…………………….. -1.01………0.60

Davis, Khris ……………..-15.64%………………………-6.26…….. 0.80

Pena, Brayan …………… -16.67%……………………. -6.67…..…..0.40

Suarez, Eugenio …………. -8.53%……………………. -3.41……… 0.80

Hamilton, Billy ……………20.61%………………….. ..8.25………..1.00

Grandal, Yasmani …………12.91%……………………-5.17……..…1.40

Kendrick, Howie ………… -10.60%……………………-4.24……….1.10

Belt, Brandon …………… …-0.18%……………………-0.07………3.90

Panik, Joe ………………… …8.23%…………………… 3.29……….3.30

Aoki, Nori …………………..-13.18%…………………. -5.27……….1.00

Pagan, Angle ………………… 0.16%……………………0.06………-1.90

Blanco, Gregor ……………….-4.88%…………………..-1.95……….1.10

Ahmed, Nick ……………… ….3.57%………………….. 1.43…….... 2.50

Lamb, Jake ……………………-6.00%………………….-2.40……… .1.70

Inciarte, Ender …………………2.58%…………………..1.03…….... 5.30

Tomas, Yasmany ……………-11.06%………………….. -4.42……...-1.30

Hill, Aaron …………………….1.31%…………………… 0.52…..… -0.30

Alsonso, Yonder ……………..15.86%…………………….6.35…….. 1.80

Gyorko, Jedd …………………. 9.42%…………………... 3.77…….. 0.50

Amarista, Alexi ……………….-5.93%………………….. -2.37…….-0.50

Spangenberg, Cory ………….. -3.70%…………………. -1.48…….. 2.10

Hundley, Nick ……………….. -19.29%………………… -7.71……. 1.80

Paulsen, Ben …………………… 7.01%…………………. 2.80……. 0.80

Martin, Russel ………………….24.84%………………… 9.94……. 3.30

Smoak, Justin …………………. 50.00%………………..20.00……..1.30

Goins, Ryan …………………….. 9.38%……………… …3.75….… 2.70

Colabello, Chris ………………..-13.10%……………….. -5.24…….0.70

McCann, Brian …………………-20.00%……………….. -8.00…….2.80

Teixeira, Mark ……………………9.70%………………… 3.88…….3.80

Drew, Stephen …………………. 14.55%………………… 5.82…… 0.40

Ellsbury, Jacoby …………………. 3.97%……………….. 1.59……..1.90

Beltran, Carlos …………………..-14.39%……………… -5.76……..1.00

Joseph, Caleb ……………………-10.45%……………… -4.18……..2.20

Hardy, .J.J ………………………..23.68%………………..9.47….…..0.00

Pearce, Steve …………………….. -2.52%……………… -1.01……. -0.40

Jones, Adam ……………………. -21.28%…………….. .-8.51………3.20

Paredes, Jimmy …………………….6.50%……………… 2.60………0.20

Flaherty, Ryan …………………….. 8.78%……………… 3.51……...-0.40

Rivera, Rene ……………………….-6.57%…………….. -2.63………-2.00

Loney, James ……………………… -0.96%…………….. -0.38……...-0.60

Souza, Steven ……………………… 7.59%……………… 3.03……... 1.00

Guyer, Brandon ……………………10.39%……………… 4.15……... 1.90

Pedroia, Dustin …………………… -9.54%……………. ..-3.81………2.00

Sandoval, Pablo ………………… -20.24%……………… -8.10……..-0.90

Ramirez, Hanley ………………… -12.33%……………... -4.93……..-1.30

Holt, Brock ……………………… ..18.60%………………. 7.44…….. 2.60

Infante, Omar ……………………….0.98%………………..0.30…….-0.80

Gordon, Alex ………………………-10.71%…………..….-4.28……..2.80

Rios,Alex ……………………..……11.98%……………….4.79……..-1.10

Suzuki, Kurt ………………………..-0.47%………………-0.19……...0.40

Escobar, Eduardo …………………-14.8%………………..-5.93…..…2.00

Rosario, Eddie ……………………..-4.96%………………..1.98……..2.20

Hicks, Arron ………………………-9.50%………………...-3.80……..1.30

Gomes, Yan ……………………….14.00%………………...5.60……..0.80

Lindor, Francisco …………………..3.90%………………..1.56……..4.60

Brantley, Michael …………………-2.36%……………….-0.95……..3.40

Chisenhall, Lonnie ……………….10.92%………………..4.37……..2.30

Ramirez, Jose …………………….14.13%………………..5.65……..1.40

Aviles, Mike ………………………-12.91%……………....-5.17……-1.30

Flowers, Tyler ……………………..-7.36%……………….-2.94……..0.80

Sanchez, Carlos …………………….2.26%…………….…0.90…..…0.70

LaRoche, Adam ……………………-9.40%……………...-3.76…….-0.80

McCann, James …………………….5.70%……………….2.28……..0.90

Cabrera, Miquel …………………….8.36%………………3.35……..5.20

Iglesias, Jose …………………………3.81%……………..1.52……..1.50

Martinez, Victor ……………………..0.60%………………0.24……-1.60

Davis, Rajai …………………………...9.32%……………...3.73…....1.60

Moreland, Mitch ……………………...9.39%……………..3.76…....2.20

Odor, Rougned ……………………..-13.45%…………….-5.38…....1.90

DeShields, Delino ……………………20.48%……………..8.19…....1.10

Martin, Leonys ……………………….-9.18%…………….-3.67…….1.10

Castro, Jason …………………………12.86%……………..5.15……1.30

Carter, Chris …………………………...9.58%……………..3.83…..-0.10

Correa, Carlos …………………………-9.24%……………-3.69……4.10

Valbuena, Louis ……………………….-0.30%……………-0.12…...2.10

Tucker, Preston ……………………...-15.56%…………..…6.22…...0.30

Marisnick, Jake ………………………14.26%……………..5.70……2.20

Springer, George ………………………7.55%……………..3.02……3.80

Rasmus, Colby …………………………1.28%…………..…0.51…...2.60

Gonzalez, Marwin ……………………..7.38%…………….2.95…...1.80

Iannetta, Chris ………………………..-18.29%…………..-7.32……0.70

Giavotella, Johnny …………………..…1.20%………..…..0.48……1.00

Freese, David ………………………….24.53%……………9.81…... 2.30

Cron, C.J ………………………………..2.08%……………0.83……0.20

Zunino, Mike ……………………………1.32%……………0.53…..-0.70

Smith, Seth …………………………….23.81%……………9.52…..1.90

Vogt, Stephen …………………………-14.14%…………..-5.66……3.50

Canha, Mark …………………………..-12.10%…………..-4.84……1.10

Sogard, Eric ……………………………..-4.40%………….-1.76…….0.80

Fuld, Sam ………………………………..27.74%…………11.10……0.90

Burns, Billy…………………………….…..1.86%…………..0.74……2.80

________________________________________________________________

Several statistics in Table 1 stand out. First, is the extreme negative jk-40WARs, an amazing 54%! If this reflects reality then the majority of substitutes increase their team’s wins over the regulars they replace! For the rWAR only 18% of hypothetical replacements outperform starters.

Second is the glaring difference between the jk-40WAR games won or lost with the rWAR’s prediction. For the rWARs only 13% of starter increased their team’s wins by 3 or more games. None reduced their team’s loses by 3 or more games. In stark contrast, 62% of the jk-40WARs were greater (in the positive or negative direction) than 3 games.

The rWAR suggests most starters have a small effect on the wins or losses of their team. The jk-40WAR suggests they have a dramatic impact. Some player’s jk-40WAR stretch statistical credulity. At the extreme negative end is Code Asche with a jk-40WAR of -11.19 (rWAR = -1.1). At the extreme positive is Justin Smok with a jk-40WAR of 20 (rWAR = 1.3).

The rWAR predicts team wins for the *entire *season. The jk-40WAR is for a quarter of the season. Predicting the whole season involves multiplying one’s jk-40WAR by 4.05 as noted in my example of Lucas Duda. No one believes that if Asche’s replacement played the entire season his team would have won 44 more games or that Smok’s replacement would lead to 80 additional losses. As no surprise, the correlation between the rWAR and jk-40WAR is almost zero (*r* = 0.12).

**The Sad Conclusion**

There are convincing arguments challenging the utility of the current WAR statistic. Most measures are context dependent and influenced by a player’s teammates abilities, the ballparks played in, and their manager’s preferential strategies. Several variables are judgment calls by statisticians. These are not objective measures, as the Soviet Union judges in the Olympics demonstrate very four years.

Of greater significance, the current WAR does not use wins as their criterion. Runs are used as a proxy. Even if runs highly correlated with wins, there are no data showing how each variable used to compute the WAR contributes (i.e., is statistically weighted) to run production or opponent run reduction. For example, the correlation between runs scored minus runs allowed to a team’s wining percentage for the 2015 season is a modest *r* =.44.

Finally, using a hypothetical, statistically created, replacement is not baseball reality.

The jk-40WAR is presented as an alternative. This statistic is reality based, using actual differences in winning percentage when a starter is in the line-up and when he is not. The jk-40WAR is team and replacement specific. When this WAR is compared to the rWAR, there are considerable differences. This is no surprise since the jk-40WAR uses actual substitutes whereas the rWAR uses a hypothetical one.

The jk-40WAR is easy to compute, but its values are often unbelievable. For many players this WAR suggests they alter their team’s winning or losing by 15 or more games per season. In rare instances this statistic suggests a player’s impact is 25 or more games.

The concept of a WAR statistic has merit. The sad conclusion, however, is currently no one has created a WAR that is reality based, believable, and useful. I recommend elimination of this statistic until one is invented that shares the essential features of the jk-40WAR but produces results realistically reflecting baseball outcomes.