Machine learning’s triangle of error
By David Weinberger
AI Outside In is a column by PAIR’s writer-in-residence, David Weinberger, who offers his outsider perspective on key ideas in machine learning. His opinions are his own and do not necessarily reflect those of Google.
Machine learning’s superpower
When we humans argue over what’s fair, sometimes it’s about principles, sometimes about consequences, and sometimes about trade-offs. But machine learning systems can bring us to think about fairness — and many other things — in terms of three interrelated factors: two ways the machine learning (ML) can go wrong, and the most basic way of adjusting the balance between these potential errors. The types of error you’ll prefer to live with depends entirely on the sort of fairness — defined mathematically — you’re aiming your ML system at. But one way or another, you have to decide.
At their heart, many ML systems are classifiers. They ask: Should this photo go into the bucket of beach photos or not? Should this dark spot on a medical scan be classified as a fibrous growth or something else? Should this book go on the “Recommended for You” or “You’re Gonna Hate It” list? ML’s superpower is that it lets computers make these sorts of “decisions” based on what they’ve inferred from looking at thousands or even millions of examples that have already been reliably classified. From these examples they notice patterns that indicate which categories new inputs should be put into.
While this works better than almost anyone would expect — and a tremendous amount of research is devoted to fundamental improvements in classification algorithms — virtually every ML system that classifies inputs mis-classifies some of them. An image classifier might think that the photo of a desert is a photo of a beach. The cellphone you’re dictating into might insist that you said “Wreck a nice beach” instead of “Recognize speech.”
So, researchers and developers typically test and tune their ML systems by having them classify data that’s already been reliably tagged — the same sort of data these systems were trained on. In fact, it’s typical to hold back some of the inputs the system is being trained on so that it can test itself on data it hasn’t yet seen. Since the right classifications are known for the test inputs, the developers can quickly see how well the system has done.
In this sort of basic testing, there are two ways the system can go wrong. A image classifier designed simply to identify photos of beaches might, say, put an image of the Sahara into the “Beach” bucket, or it might put an image of a beach into the “Not a Beach” bucket.
For this post’s purposes, let’s call the first “False alarms”: the ML thinks the photo of the Sahara depicts a beach.
The second “Missed targets”: the ML failed to recognize an actual beach photo.
ML practitioners use other terms for these errors. False alarms are false positives. Missed targets are false negatives. But just about everyone finds these confusing names, even many professionals. Non-medical folk understandably can assume that positive test results are always good news. In the ML world, it’s easy to confuse the positivity of the classification with the positivity of the trait being classified. For example, ML might be used to looking at lots of metrics to assess whether a car is likely to need service soon. If a healthy car is put into the “Needs Service” bucket, it would count as a false positive even though we might think of needing service as a negative. And logically, shouldn’t a false negative be a positive? The concepts are crucial, but the terms are not not unintuitive.
So, let’s go with false alarms and missed targets as we talk about errors.
Take an example that doesn’t involve machine learning, at least not yet. Let’s say you’re adjusting a body scanner at an airport security checkpoint. Those who fly often (back in the day) can attest to the fact that most of the people for whom the scanner buzzes are in fact not security threats. They get manually screened by an agent — often a pat-down — and are sent on their way. That’s not an accident or a misadjustment. The scanners are set to generate false alarms rather frequently: if there’s any doubt, the machine beeps a human over to double check.
That’s a bit of a bother for the mis-classified passengers, but if the machine were set to create fewer false alarms, it potentially would miss genuine threats. So it errs on the side of false alarms, rather than missed targets.
There are two things to note here. First, reducing the false alarms can increase the number of missed targets, and vice versa. Second, which is the better thing to do depends on the goal of the machine learning system. And that always depends on the context.
For example, false alarms are not too much of a bother when the result is that more passengers get delayed for a few seconds. But if the ML is being used to recommend preventive surgery, false alarms could potentially lead people to put themselves at unnecessary risk. Having a kidney removed for no good reason is far worse than getting an unnecessary pat down. (This is obviously why a human doctor will be involved in your decision.)
The consequences can reach deep. If your ML system is predicting which areas of town ought to be patrolled most closely by the police, then tolerating a high rate of false alarms may mean that local people will feel targeted for stop-and-frisk operations, potentially alienating them from the police force, which can have its own harmful consequences on a community…as well as other highly consequential outcomes.
False alarms are possible in every system designed by humans. They can be very expensive, in whatever dimensions you’re calculating costs.
It gets no less complex when considering how many missed targets you’re going to design your ML system to accept. If you tune your airport scanner so that it generates fewer false alarms, some people who are genuine threats may be waved on through, endangering an entire airplane. On the other hand, if your ML is deciding who is worthy of being granted a loan, a false alarm — someone who is granted a loan and then defaults on it — may be more costly to the lender than the missed opportunity of turning down someone who would have repaid the loan.
Now, to not miss an opportunity to be confusing when talking about ML, consider an online book store that presents each user with suggestions for the next book to buy. What should the ML be told to prefer: Adding false alarms to the list, or avoiding missed opportunities? False alarms in this case are books the ML thinks the reader will be interested in, but the reader in fact doesn’t care about. Missed opportunities are the books the readers might actually buy but the ML thinks the reader wouldn’t care about. From the store’s point of view, what’s the best adjustment of those two sliders?
That question isn’t easy, and not just because the terms are non-intuitive for most of us. For one thing, should the buckets for books be “User Will Buy It” or, perhaps, “User Will Enjoy It”? Or maybe, “User Will Be Stretched By It”?
Then, for reasons external to ML, not all missed opportunities and false alarms are equal. For example, maybe your loan application ML is doing fine sorting applications into “Approve” and “Disapprove” buckets in terms of the missed opportunities and false alarms your company can tolerate. But suppose many more applications that become missed opportunities are coming from women or racial minorities. The system is performing up to specification, but that specification turns out to have unfair and unacceptable results.
Think hard and out loud
Adjusting the mix of false alarms and missed opportunities brings us to the third point of the Triangle of Error: the ML confidence level.
One of the easiest ways to adjust the percentage of false alarms and missed targets is to change the threshold of confidence required to make it into the bin. (Others way including training the system on better data or adjusting its classification algorithms.) For example, suppose you’ve trained an ML system on hundreds of thousands of images that have been manually labeled as “Smiling” or “Not Smiling”. From this training, the ML has learned that a broad expanse of light patches towards the bottom of the image is highly correlated with smiles, but then there are the Clint Eastwoods whose smiles are much subtler. When the ML comes across a photo like that, it may classify it as smiling, but not as confidently as the image of the person with the broad, toothy grin.
If you want to lower the percentage of false alarms, you can raise the confidence level required to be put into the “Smiling” bin. Let’s say that on a scale of 0 to 10, the ML gives a particular toothy grin a 9, while Clint gets a 5. If you stipulate that it takes at least a 6 to make it into the “Smile” bin, Clint won’t make the grade; he’ll become a missed target. Your “Smile” bucket will become more accurate, but your “Not Smile” bucket will have at least one more missed opportunity.
Was that the right choice? That’s not something the machine can answer. It takes humans — design teams, communities, the full range of people affected by the machine learning — to decide what they want from the system, and what the trade-offs should be to best achieve that result.
Deciding on the trade-offs occasions difficult conversations. But perhaps one of the most useful consequences of machine learning at the social level is not only that it requires us humans to think hard and out loud about these issues, but the requisite conversations implicitly acknowledge that we can never entirely escape error. At best we can decide how to err in ways that meet our goals and that treat all as fairly as possible.