A gray-bearded statistician yells, “You kids get off my lawn.”

Two comments on a blog entry by Andrew Gelman illustrate my uncertainty about machine learning (ML) and other non-statistical techniques. A comment quoted an R developer:

To paraphrase provocatively, ‘machine learning is statistics minus any checking of models and assumptions’. — Brian D. Ripley (about the difference between machine learning and statistics) useR! 2004, Vienna (May 2004)

To which Gelman replied:

In that case, maybe we should get rid of checking of models and assumptions more often. Then maybe we’d be able to solve some of the problems that the machine learning people can solve but we can’t!

I’m a conventionally trained statistician who learned to use “data” as the plural of “datum”. I learned from undergrad onward that models that we wished to use to say anything about the world beyond our own wee datasets were preferably parametric, relied on either random sampling or to be at least reasonably representative of a population we wished make draw inference about, that model selection would be very deliberate and include only covariates that were statistically significant or reduced confounding in others, and that always gone back to the research question. And that there would be a specific research question prior to analysis. Searching for correlations between every pairwise combination of variables in a dataset (which may shed no light on the research question), is, in my field, commonly referred to as going on a “fishing expedition”, or “Type II Error generation”. Inclusion of all available data into a completely saturated multivariate model, even when paring back to a more parsimonious model, was earthily described by my undergraduate stats professor as — I kid you not — “throwing shit at a wall and seeing what sticks”.

I abhor dogma, and I recognize the value of heresy, so I want neither to cleave to my prior learning, nor accept the validity of a new approach uncritically.

So, I’m confronted by these new methods and I do not yet have a sense for whether they are valid. Was this stuff cooked up by dude-bros who scraped up a shitload of data, ran code on it, and got spurious results that nonetheless make someone a killing on the stock market, and with the same indifference to sound analytical methods, others are treating this anecdote as representative data, and seeking to do the same? Are we looking at cargo cult programming here?

OK, probably not, so cool it, stats-boy. These techniques have become widely used, and to paraphrase Gertrude Stein, there is probably some “there” there. Of course, methamphetamine and Uggs are also widely used in this world — the horror, the horror. But where is their use indicated? One instructor urged me to turn away from the dusty old concepts used in academic statistical thinking, but I’m of a mind more to explore these new approaches with a critical eye and to use them in an informed manner. A colleague on the local bioinformatics faculty confirmed that genetics studies are prime for this, as there’s a paucity of data on genes except for the ACGT of the genes themselves.

The first step here is reading up on the matter. And maybe ML et al aren’t as bereft of statistical soundness as I’m worried they are.

Show your support

Clapping shows how much you appreciated Ted White’s story.