When Accuracy is Academic and Data Deceives

Published in

GAMMA — Part of BCG X

10 min readSep 30, 2019

Two common pitfalls in the application of AI analytics to business problems

The rise of artificial intelligence has brought with it a strange paradox: Although AI itself is grounded in data and logic, business users can be tempted to throw rationality out the window when dealing with AI models. This is completely understandable. After all, AI algorithms have been imbued with nearly magical properties, capable of telling a business how to identify risky customers, predict customer behavior, structure the perfect incentive program to reduce customer churn, and just generally provide elegant, unbiased answers to a business’s most vexing questions.

If only we could put our trust in what we wish were pure, scientifically derived AI models. Alas, it is more complicated than that. In this article, I will describe two of the major pitfalls associated with artificial intelligence models, discuss why it can be so difficult to avoid these pitfalls, and offer some ideas on how to move past them to make AI more useful and profitable.

Pitfall #1: Accuracy isn’t everything

All models that make predictions are designed to meet specific performance standards, such as those for sensitivity or specificity. The data scientists who create them may also measure the performance of the model on such mathematically derived metrics as area under the curve (AUC), positive or negative predictive value, or lift. One common — and commonly misunderstood — metric is “accuracy.”

Imagine, for example, that you have commissioned an algorithm to help you detect anyone with malicious intent attempting to board an aircraft. You implement an algorithm guaranteed to be more than 99.99% accurate to operate a black box that sits above the boarding gate. Every time someone with a statistical chance of having bad intent enters, a red lightbulb on the box will flash. If someone without bad intent enters, the box will flash a green bulb. Now imagine that your black box always flashes green, not because it hasn’t spotted any shady characters, but because both bulbs got wired to the “no bad intent” circuit. Your black box will still be right more than 99.99% of the time because more than 99.99% of people who walk through airport security will not have malicious intent. In other words, this useless black box still has better than 99.99% accuracy. If that is true then, clearly, accuracy is a bad metric.

Metrics Should Pass the Common-Sense Test

If you want to know whether a model is any good, the best metric is usually a comparison of what the outcome (often profit) will be if you use the model versus if you don’t. This may sound obvious: Any business would want to make sure that an algorithm, which usually takes considerable resources to create, would result in a positive business outcome. In my experience, however, companies rarely subject a model to this test (more below on why they don’t). This isn’t just an academic point: It can lead to bad business decisions. Here are two examples of how failure to apply this metric can lead to reduced profits.

My team at BCG previously worked with an insurance company that created a model to identify customers most at risk of churn. As part of the model, the company offered discounts to all the high-risk customers they were able to contact, a strategy that did save some customers. My team was asked to see if we could improve the prediction algorithm, but before we began, we applied a common-sense economic lens to it. Our recommendation was not to improve the algorithm, because no plausible improvement would have been enough to justify this discounting approach — the cost of discounts given to customers who wouldn’t have left (“false positives”) far exceeded the benefits of retaining those who would have left but for the discount. Instead, we recommended that the discount program be stopped. If you have a model that tells you to reduce your prices to retain or acquire more customers, you may be tempted to use it. What the model may not tell you is that the overall economic effect might be that you will gain customers but lose money.

Some of my colleagues worked with a bank that invested in sophisticated “will they buy it” propensity models to predict the chance of customers buying particular products in the coming months. Armed with this new model, the company’s product managers, eager to meet their sales targets, insisted that the company make contact with and offer a new product to every single customer whom the algorithm determined might purchase that product. For each product, they picked the 80% of customers most likely to buy it. In other words, they offered the product to everyone except the one-fifth of customers least likely to buy it.

First, we found that 70% of the actual buyers would have bought the product anyway, regardless of whether it was marketed to them or not. This indicated that most of the resources — and customer goodwill — spent on the marketing program could have been used for something better. A perhaps worse outcome was that 15% of recipients chose to opt out of all future emails from the bank, thereby depriving the bank of a chance to do any future marketing to these customers. This is why it was important to build the “what if we do nothing” model at the outset. If they hadn’t had this high opt-out rate, their original approach would have been acceptable, even if it might have been annoying for some customers.

We replaced the “will they buy it” model with two models — “will they buy if we market to them” and “will they buy if we don’t market to them” — and then targeted only those customers for whom the marketing program would make a difference. As a result of taking this “uplift” approach, the opt-out rate fell by 75% and the take-up rate doubled.

In both of these examples, the prospect of being able to use a data-based model to determine a course of action was beguiling: artificial intelligence to the rescue! The problem arises when you build the right model but use it to answer the wrong question. Trying to determine if a customer will opt out is not the same issue as whether the company should offer a discount to that person. Nor is trying to determine the chance the customer will buy the same as whether the company should email that person.

When considering the potential value of an AI model, you must measure something the organisation cares about: Examples include margin, customer satisfaction, or quality-adjusted life years. And you must frame the measurement in terms of the difference between using the model (in the way you plan to use it) and not using it. Other metrics such as accuracy, AUC, sensitivity, specificity, and lift are useful shortcuts for model iteration and comparison across applications, but they should never be used to make a business decision. No statistical metric can tell you whether using your model is profitable.

Pitfall #2: Data-Based Modelling May Seem Like the Perfect Solution, But Data Can Be Deceptive

The mantra associated with AI modelling is that, given the proper data, you can create decision engines that will help you achieve business objectives. Once again, it is easy to be seduced by the apparent elegance of the AI modelling process: Input the proper data, subject it to the perfect, mathematically derived algorithm and, voila, you have your perfect new business decision. But real data hides a wealth of complexity. Consider two more examples:

My BCG colleagues worked with a logistics company that built an AI model to see if it should offer lower prices to generate more business. They discovered that, other factors being equal, customers who were offered lower prices appeared less likely to buy than those offered the full price. It wasn’t a matter of customers equating lower price with lower quality — this was a genuine commodity business in which lower price is always better. It was an illusion, caused by the fact the salespeople didn’t offer the discount to people whom they thought were not seriously looking at other logistics companies’ offers. The salespeople did offer the discount to more fickle customers — those they thought might not buy at all. The critical piece of information — how seriously was this customer shopping around — simply wasn’t captured in the data. To correct for this, the team had to estimate what price the customer would be expecting, and then base their estimates of getting the business as deviation from that price.
We also worked with a government department responsible for helping jobseekers find work. A service they offered was to provide interpreters for people who could not speak the local language. We compared outcomes for people who had an interpreter with the outcomes of those who did not, correcting for everything else we knew about the jobseekers. The people who got the interpreters appeared to fare much worse. It is very unlikely that this was because the interpreters were in some way harmful, and we certainly did not recommend cutting the program based on these results. Our hypothesis was that people who needed interpreters were worse at speaking the local language than was suggested by the data we had about them. Their need for an interpreter was telling us something about them we could not discern from the data — just as a discount in the previous example told us something we didn’t know about the logistics company’s customers.

Data From Anomalous Events Can Improve Overall Data

The fact is that historical data alone is almost never sufficient to guide business decisions. Anyone who claims to have a completely automatic analytics system, into which you just feed data to arrive at the perfect answer, is either naive or deliberately misleading. The “gold standard” answer is to do an experiment — a randomized controlled trial. But there are alternatives. Indeed, much of the discipline of econometrics is devoted to ways of finding out, in the absence of randomized controlled trials, what factors actually contribute to a response.

One way to find causation is to take advantage of anomalous events such as re-organisations, industrial action (strikes), or infrastructure outages to look at data from a different angle. We recently had the opportunity to look at a bank’s customer retention rates. For the past five years, this particular bank had been offering discounts to any customer who threatened to move their mortgage to another bank. When the bank went through a reorganization, it stopped this activity for 10 days. In doing so, they accidentally performed the “what if I don’t?” experiment: What would happen to the bank’s retention rate if the retention department wasn’t there to offer incentives to customers who threatened to leave? We were able to review customer activity over those ten days, and found that most of the customers who said they wanted to move their mortgage didn’t — even though they weren’t offered a discount. Through this unplanned experiment, the bank learned that it would do better if it offered a smaller discount or, perhaps, no discount at all to these apparently fickle customers. The odds were that these customers would stay anyway.

Three Ways to Avoid Pitfalls

It is easy to understand why companies are seduced by the promise of accuracy and fail to consider a simple question of “should or shouldn’t I use the model.” For example, to correctly answer the “should I/shouldn’t I” question, a company might need to know how many new customers it will gain once it makes an offer. But how can you possibly know that in advance? Along with that uncertainty comes a concern that if the company fails to offer a discount, it might lose sales. These two problems may seem insurmountable, making “use the model” the only rational choice. There are, however, at least three ways to make the path ahead clearer:

1. Favor economic/outcome measures over all measures of model performance (especially measures of “accuracy”). In the case of discounting, a good way to measure would be to compare the expected margin if you offer customers a discount versus if you don’t.

The problem is that you won’t know how predictive the model will be until you build it, and you’ll need to know that in order to do the economics. But in fact, you can — and should — do the economics before you build the model. Just assume the model will be excellent but not “magical.” Anyone who has had to build a range of models will have a good idea of what the best performance might be (for example, a top-decile lift of about 5 or an AUC in excess of 0.9). If the economics don’t stack up with the best model you might reasonably build, you can save your effort and not build the model.

2. If it appears that the economics do make sense, your next focus should be on experiments, not on being right first time:

Don’t try to squeeze so much analytics from data that you’ll be right first time in the interventions you make. Instead make sure you’re “less wrong the second time.” Use the data you have to come up with sensible variations of interventions to test. You won’t readily know how effective your variations will be until you test them, and you’ll need a control group to make sure you wouldn’t have had the same result anyway.
Look for natural experiments, such as when the mortgage bank temporarily stopped sending out discounts, or when things have gone wrong in the past. These might tell you the answer to the “what if I don’t” question.

3. If you are getting counter-intuitive conclusions, such as when the logistics company found that discounting reduced the chances of making a sale, you’re probably mistaking cause and effect. In this case, you’ll need to:

Talk to subject matter experts — the people closest to the situation. They may be able to help you identify the problem.
Leave some latitude in your system for people who have access to hidden insight. For example, if only the salespeople really know how fickle the customer is, give them some pricing flexibility. But you’ll need to think about their incentives. Salespeople incentivized by volume, for example, will almost always use their full discounting authority, so you’ll need to make sure you have some constraints in place.

Avoid Shiny Objects

The experience and judgement I’ve gained in my professional life suggests AI models are sometimes used essentially because they are the latest shiny new thing. In the glare of the new, and in light of the hype that typically surrounds it, it is easy to latch onto some measure of shininess rather than some measure of usefulness. Like a flashy new AI model, accuracy can be a measure of shininess — and end up being a dreadful metric. A model that looks solely to historic data — without an understanding of cause and effect — in order to guide business decisions may be an equally poor choice.

None of what I have expressed above should be interpreted to mean that you should not take every advantage of emerging artificial intelligence tools to help you make important business decisions. Just make sure that, when you do, you allow common sense, your own experience and judgement, and input from subject matter experts to inform your final decision.

When Accuracy is Academic and Data Deceives

Two common pitfalls in the application of AI analytics to business problems

Written by Adam Whybrew — Partner, BCG Gamma