Machine Learning, Statistics, and Data Science--A False Distinction
There has been a lot of talk about hiw “data science” is new and “machine learning” is different from “statistics.”. I am not so sure if this is the case. (But, for what it is worth, I am more of a statistician by training, although my background and inclination are those of an applied probability theorist.)
The fundamental statistical problem can be encapsulated in two statements:. there are signal and noise, and we don’t know what the signal is; So we posit that, if we minimize the noise in the data we got(ie the sample), we get a good approximation of the signal.
Consider a simple example: 90% of the sample is 1 and 10% is 0. If we want to minimize the loss in the form of average squared errors, we can easily prove that the best predictor is 0.9. But 0.9 is not necessarily the “best” predictor:. if the loss we want to minimize is whether the prediction is exactly right or not, 0.9 is a terrible predictor: it predicts “correctly” exactly never. The better predictor, at least based on the sample we have, is 1, which will be wrong only 10% of the time. In absence of more information, these are the best predictions we can offer. Even then, what qualifies as the “best” prediction still depends on the loss function--which, presumably, depends on what you want to do with the fruits of your analysis.
Note that there is an unspoken assumption here:. the faith that the sample is a reasonable approximation of the reality. Usually, we know better than that and try to ‘”cheat” one way or another, eg cross validation, hierarchical models, weights, etc. But we are ultimately constrained by the sample we have--regardless of what we do. We know the sample. We don’t “really” know the “truth.”
So far, no difference whatsoever between statistics and machine learning (or, really, “statistical learning.”). The alleged difference crops up in the form of the many formal assumptions of statistics, as many “data science” folks insist. To a large degree, this is fair:. many statistical methods were developed in an era when computing power was limited and minimizing complex “loss functions” (defined broadly, to include such things as iteratively optimizing the weights in a neural network) via sophisticated numerical methods was not generally feasible. The assumptions were necessary to keep the math tractable. Since the models were necessarily cruder, “statistical” models don’t really predict as well as “machine learning” models. That is really just a matter of technology, however.
I suspect working with the models that were almost invariably wrong in some fashion gave “statisticians” a healthier appreciation of the noise, at least on average. We know silly assumptions were made. We know our predictions are wrong, but we also know how they came to be wrong and how wrong they are, conditional on the assumptions. But this too is a misleading and unfair statement: a good engineer does not achieve his or her goal via unconditional improvement on everything, but by knowingly trading off gains in the areas of more relevance for his or her purposes for losses elsewhere. But the key is that he or she remains cognizant of what sacrifices have been made and the implications for the predictions. This brings us back to the loss functions and how they are optimized--and how they shape the noise conditional on the prediction--ie if X is your best prediction, how wrong is it, when, and why?
I don’t think I can escape the fact that I am somewhat of a Luddite: for starters, I like old statistical methods. I’ve come to appreciate machine learning techniques, but they still “feel” a bit odd. The more sophisticated loss functions and optimization techniques, even if you know exactly what they are doing at a very high level, are not as open to making ready sense of what errors to expect of them. It requires knowing the actual moving parts and how they act in contact with actual data, both while in training and in real life. The more sophisticated they are, the more there are to make sense of, after all.
Perhaps this is the wrong attitude: car engines are, thanks to the technological change, harder to make sense of today than in, say, 1930. Yet, more people drive today despite far more widespread technological ignorance (at least relatively speaking) precisely because the same technological changes that made getting a “feel” for what goes on inside the engine harder made driving itself easier. But this has also changed the market for car repair and maintenace: far fewer DIY mechanics and much greater import of the properly trained technicians who can deal efficaciously with the new technologies. This transition was hard: a lot of mechanics in 1990s could not deal well with either the new or the old. How is the transition in the realm of data analytics, I wonder?