Learning from where the Models are Wrong.

I couldn’t sleep after writing up my rant yesterday about inferring noise distributions. There is something there that I hadn’t elaborated quite well in the original post that needed a bit more thought and organization. Let’s think through the example that I had borrowed for the post from the lecture notes about GBM:

The model is being trained on a training data where the “right answer” is indubitably G. In the ideal universe, the model should return G with p=1 exactly, but it does not. While the model returns a full spectrum prob distribution with nonzero probability for every letter of the alphabet, let’s be lazy and assume that P(G|training, G) = .7 and P(Q|training, G) = .3, where “training” = some rounds of training after which the gradient hit a local minimum. Likewise, we might have found that P(G|training, Q) = .2 and P(Q|training, Q) = .8, with P(G, training !G AND !Q) = 0 for all !G AND !Q.

When the model is being used to predict from the real data, that we have detected a G from an unknown letter tells us rather little about what that letter is. The training set told us that P(see G|letter=G) might be .7 (with the big assumption that the model did not overfit on the training set — a major potential danger.), but, by definition, we also know that P(letter =G|see G) = P(letter = G)P(see G|letter = G) + P(see G|letter = Q)P(letter = Q). A lot of this information is, for now, missing: not knowing the distribution of the application set, for example, we don’t know P(Q) vs. P(G). If it’s in English, Q is relatively rare. If it’s in Spanish, that assumption would not hold. Put differently, that the model is correctly classifying G as G and Q not as G means relatively little about how inaccurate the end result will be. If the document is in Spanish and Q’s are quite common, the proportion of letters that are identified as G’s that are actually Q’s will be quite high.

This is hardly something unusual: this shows up in every introductory texts in statistics, when Bayes’ formula is first mentioned. The problem this raises is somewhat analogous to the the logic behind the p-values trap (or, perhaps even worse). When correctly used, p-value is P(observed outcome|null hypothesis). The common mistake with regard to p-values has always been thinking that it actually means P(null hypothesis|observed outcome). As with the p-values, the problem arises when your model is wrong, but still produces the outcome consistent with what you would expect on the assumption that your model is right. Since you are training your model on the premise that you would get the right answer to the right question, and not the right answer to the wrong question, so to speak, the process is open to the same pitfall.

One should think that this can be implemented relatively easily, if the users of ML techniques are cognizant of the problem. The training set already provides P(G|!G) for every !G, for example. For every letter, the probability that some other letter was read as this letter and this letter read as something else can be calculated and represented as a 26x26 matrix (or, a half thereof, technically). (If the model is being used in a more complex context, however, this could be a dramatically more complicated proposition. Identifying letters is a very simple chore. How would we be able to catalogue all the mistakes in a driving AI? Conceptually, Deep Learning is NOT that complicated. On the implementation end, the number of moving parts in all the different tools is dizzyingly numerous and, at least as I see them, confusingly similarvariant — that is, close enough conceptually, but different enough in the particulars that they are not completely substitutable for one another in practical usage.) It is not difficult to see how trying to systematically account for how wrong, where, when, and why different tools and their implementations are in any reasonable framework can drive people nuts, without much obvious practical use on the other end. But I am also convinced that this, in some form, is absolutely essential step in the medium to long run, if data analytics is not to go the way of astrology — and mind you that the “good part” of astrology did become something very important, even if rather less “useful” in the short run, namely in the forms of astronomy and physics.) While it is tempting to try to reduce this further to a single metric of “predictive performance,” that would be a mistake. Mistaking Q for a G is a relatively small mistake, potentially, in English. It can be a rather major mistake in Spanish, due to the difference in the frequency of Q in different contexts. Usefulness of a model is context dependent.

If ML models are to be used as a tool, one might as well think like a good engineer: every engineering product makes compromises, trading off some deficiencies in some areas for superior performance in others. The good user of an engineering product should be aware of the compromises made — where the model is wrong — as well as where the model does well, and ask whether the model is suitable for the problem on hand, given the likely mistakes to be made. A friend in mechanical engineering once told me, as a half-serious jest, that the biggest problem in engineering is that things will go wrong, and the job of engineers is to anticipate them, but things always find ever more creative ways of going wrong. This is a valuable attitude to have.

PS. An observant reader should point out that what I am really saying is that loss functions need to be specified appropriately, rather than algorithms. That is exactly the point. But specifics of the loss functions follow from what problems you are trying to address, in what context, and what cost you are willing to bear. These are not, on one hand, problems that algorithms can be counted to solve, but can be brought in as a part of the solution. I myself became interested in ML as means of dealing with heteroskedasticity systematically in linear regression context--but I am more interested in noise than signal, and I have a complete understanding of the noise from simple models, whereas noise from allegedly better predictive models are baffling.