Using, Misusing, and Abusing Statistics — Baseball Edition (Again)

Henry Kim
Henry Kim
Jul 21, 2017 · 5 min read

Beyond Boxscore has a fascinating story on Ricky Nolasco, the owner of the biggest gap between ERA and FIP in MLB history, as far as we have the data. It is worth noting, of course that it was not long ago that the Angels had another pitcher with an odd gap in ERA, this one going in the opposite direction — Jered Weaver. While Nolasco’s ERA-FIP gap makes him look like he has often been a much better pitcher than his ERA alone might appear to indicate, Weaver generally had very good ERA but mediocre FIP.

Now a bit of background: many people, those who consider themselves sabermetrically inclined (ahem) but not very learned on the inner workings of how the statistics are calculated all too often engage in trash talk about which stats are “better,” which in turn, are associated with which players are over- and under-rated. Weaver, for example, was the topic of this conversation a lot, especially since ERA has increasingly become a statistic that many came to consider less reliable (although not necessarily as bad as W-L records). But therein lies the rub: in order to meaningfully claim that one statistic is “better” or “worse” than another, it is necessary to arrive at what it is that they are measuring.

ERA and FIP might superficially look similar in that they both measure “pitching performance.” But “pitching performance” is an amorphous and multidimensional thing, not something homologous to real numbers that can be captured in a single number without losing a lot of information in process. What George Box meant when he famously pronounced “all statistics are wrong, but some are useful” was exactly this: statistics truncate reality, dropping a lot of information contained in the reality in order to hopefully focus on some aspects that are more “useful” than others. Deciding what aspects of available information are “useful” or not comes from the substantive “understanding” of the reality, basically assumptions, “theories”, and other structural “superstructure” that goes into creating those formulas of “statistics.” The advantage conferred by these assumptions is that they reduce the amount of data and complexities in processing them that is required to arrive at a conclusion. They also ensure that the conclusion might be wrong when the assumptions don’t fit the particular circumstances. In general, useful enough models are “good enough” to match the reality “most” of the time. But “most” is not all and that is why we need to pay a lot of attention to “errors” and “outliers” where the reality diverged from the model — or, as is the case here, different models supposedly capturing the same concept diverge from each other, since the “reality” is too amorphous to be meaningfully captured en toto. (NB: This is where the “data science” is doing both better and worse than the traditional “statistics.” With more data and more computing power, the assumptions needed to arrive at a useful conclusion are less necessary today than the days when the traditional statistics were being formulated. The rub, however, is that not all situations yield plentiful data, and indeed, the more mundane and uninteresting the situation is, the more plentiful the data will be. The mistake that atheoretical “data science” is prone to make is to mistake the unusual and extraordinary for the mundane because most of the data is from the latter. Recall that, in 2016 presidential election, some savvier analysts (e.g. Nate Silver) were beginning to notice that something odd was going on for which sufficient data was lacking to generate a quantitative prediction with, while the more naive were insisting that their mountains of data ensured that they knew what was going to happen. This, of course, happened before — in 1936, The Literary Digest had tons of data and a methodology that worked fine in several previous elections, while George Gallup had very little data — thousands of obs rather than millions — but a sense that something was wrong. Guess who had the right sense of what was going to happen?)

So let’s look at ERA and FIP. ERA is simply the number of “earned” runs given up by a pitcher per 9 innings. FIP only takes into account the so-called “three true outcomes”: home run, walks+hit by pitch, and strikeouts. So FIP is missing something big: hits that are not home runs. You can give up hits that lead to runs that do not leave the yard, which, in turn will inflate the number of runs you give up, but not any of the components that enter into FIP. A pitcher who gives up contact (i.e. fewer strikeouts) but not many hits will be treated poorly by FIP than ERA. As the article notes, the conceit behind FIP (and related BABIP) is that, as long as the ball does not leave the yard, it is possible for the fielders to do “something” about that that can prevent them from becoming hits. Perhaps, as the article observes, better fielders might help, as, I imagine, better positioning of the fielders. But not all batted balls are the same: weak grounders or pop flies are not the same as hard-hit grounders or line drives, even if they do not leave the yard: hard hit balls can still be turned into outs, but with much less probability than weakly hit ones, with the relative rates, again, depending on both fielders’ skills and defensive strategies. So this is not necessarily wrong, but somewhat misleading perhaps and certainly incomplete as a perspective. Perhaps the better perspective on this is that larger FIP-ERA gap indicates a pitcher who may be improvable, but someone who needs further investigation, with more data — which is increasingly possible, thanks to the likes of Statcast data.

This is the significant point, I think, resulting from reducing the reality to “statistics.” Complex reality is reduced to a set of numbers by cutting out features thought to be “less important.” If different statistics purporting to measure the same general concept are in agreement, there is nothing to see there. If they are not, that means they are cutting out different features that may or may not be important — but, whatever it is, requires further inquiry in course of which we may have to reconsider what we thought to be “more” or “less important,” and certainly might call for employment of yet more data along the way. As Tolstoy might have said were he a statistician, all mundane observations are alike, but all interesting ones are different in their own way. How we lie with statistics is to pad our data with the mundane observations and insist that there is no interesting problem worth investigating — because all the interesting ones are “errors.” In such cases, all our “statistics” are “true” alright, but still incomplete and misleading.

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade