A Product Manager’s Guide to When to Say No to Machine Learning (and When to Say Yes)

Published in

Structured Ramblings

8 min readDec 22, 2019

Product managers, myself included, face increasing pressure to integrate machine learning into their products. This pressure comes from many places: engineers and data scientists excited by cutting-edge algorithms, sales and marketing stakeholders looking to solve problems faster and more cheaply, our own desire to serve users better.

As a field, product is in the early days of using machine learning (ML) to power critical user-facing applications. After working on ML applications across real estate, recruiting, and genomics, I’ve found the following to be useful when considering the impact of ML on user-facing products:

Distinguishing ML models from rules-based models
Recognizing rules-based models have advantages over ML models
Answering three key questions before building an ML model

At the end, I assess a variety of real-world use cases against the three questions, identifying good and bad places to use ML.

Not all good models are ML models

At a high level, models can be divided into “learned” (or ML) and “rules-based”. For learned models, the developer feeds the model data, and the model learns associations between inputs and outputs. These learned associations may not conform to humans’ expectations. For rules-based models, the developer programs explicit rules and (absent bugs), the model’s output conforms to human expectations.

It’s a misconception that learned models are more sophisticated than rules-based models. Learned models can be mathematically complex (e.g. deep neural networks) or relatively simple (e.g. linear regression). Similarly, rules-based models can be complex (e.g. physics simulations requiring supercomputers to run) or simple (e.g. one or two if statements).

Another key point is that backtesting, a staple of rigorous statistics, applies to rules-based models, not just learned models. Backtesting means using a model to make predictions on past instances with known outcomes and comparing the predictions to the known outcomes. This can be done with rules-based models, and all the same accuracy metrics used for learned models (AUC, F1, r2, MSE…) can be computed for rules-based models. Not only can backtesting be done for rules-based models, but it should be done to prevent overfitting.

Rules-based models have advantages over ML

Rules-based models require domain knowledge. To decide what the rules should be, the product team has to know something about the problem space. This is a useful constraint. It promotes deeper understanding of the model’s use case and users, and encourages the product team to seek constant improvement of their understanding. There are few problems where the naive application of a learned model to an industry dataset will meaningfully help users — unfortunately such low-hanging fruit has already been picked in most industries.
Rules-based models generally need less data. Rules-based models don’t have to be trained! This is particularly useful for new applications or startups, where there may be deep domain knowledge but little data.
Rules-based models are generally easier to debug. They’re easier to reason about than learned models, because rules-based models reflect only associations that humans have encoded into them. Moreover, these associations are expressed directly in code. Conversely, in a learned system, associations are expressed not in code, but in coefficients that may be difficult to find, hard to understand, and temporally unstable. When a production model starts misbehaving at 3am and a product team gets woken up to fix it, it’s far preferable to search for a fault that is guaranteed to exist somewhere in human-readable code.

**When you should use an ML Model**

Despite the advantages of rules-based models, you can’t visit TechCrunch without reading about how learned models are driving incredible advances ranging from self-driving cars to language translation. And if you’ve read about AI research over the past decade, you know the pendulum has swung hard in favor of learned models over rules-based models. So when should a product manager favor a learned model for a user-facing product? When the answer to the following three questions is “yes.”

1. Is a probabilistic, or opaque, rationale acceptable?

This is a question about the model’s use case, not the model itself. It’s a question about values. Ask yourself, if this model was applied to me, would I be ok with any decision it might return?

When Netflix makes a poor, unintuitive recommendation to me, I don’t feel like my values have been violated. The stakes are low, and there’s no “social contract” that says I have a right to understand my movie recommendations. It’s tempting to extrapolate that we should only use probabilistic rationales in low-stakes situations. Unfortunately, it’s not that simple. Consider medicine, where it’s widely accepted that statistical (probabilistic) results are the best way to make life-and-death decisions like approving new drugs. Whether a probabilistic rationale is acceptable is a question of values, not a question of scale or seriousness.

Here’s another medical example, that illustrates the counterpoint. Imagine that after seeing a doctor, you filed a claim with your health insurer, and it was denied because the probability your claim didn’t meet the insurer’s coverage rules was deemed too high. I (and the doctor) would be incensed. I would want an understandable, deterministic answer — what rule did it violate? Nonetheless, insurers are using opaque learned models to adjudicate claims, because they’re struggling under the accumulated complexity of their own payment rules.

Similarly, consider the process of granting parole. Given the history of prejudiced parole decisions in the US, these decisions should follow highly interpretable rationales, to prevent the perpetuation of bias. This doesn’t mean a model can’t be used, but society should at least understand the model’s logic. Nonetheless, many parole decisions today are made with black-box learned algorithms (many others have bashed this already, so I won’t go into detail, but here’s a good take-down of a black-box parole algorithm).

People’s values differ, so there will always be large gray areas, but we should start the ML exploration process with the question of “does a probabilistic decision align with our values?”

2. Does the data faithfully represent the real world?

A model requires two types of data: labels (aka ground truth, outcomes, the dependent variable) and features (aka inputs, independent variables).

First, we need labels that represent reality. A lot can get in between reality and data labels. Take datasets related to crime. Since criminals want to avoid punishment, they work hard to prevent their crimes from showing up in a conviction dataset! The “true” labels are only a subset of actual crimes, and unfortunately, since innocent people get convicted, they also include false positives. Building an unbiased model with such training data labels is very difficult.

A common problem that reduces the quality of labels is censoring, or when the data generation process also prevents labelling some instances. Consider a model that predicts job applicants’ performance. Performance reviews are available for applicants who were hired, but, by definition, not for rejected applicants. The detrimental impact of censoring can sometimes be reduced through clever statistics, but if your use case involves censored data, it’s a big red flag.

Second, we need features that represent our prior beliefs about what inputs matter and why. In other words, we need to have a causal mental model (Footnote 1) and be wary of features that don’t fit that causal model. Such features can improve backtested accuracy while reducing real-world accuracy. Here’s an example where a learned model was used to identify cancerous skin lesions from biopsy images:

When dermatologists are looking at a lesion that they think might be a tumor, they’ll break out a ruler — the type you might have used in grade school — to take an accurate measurement of its size. Dermatologists tend to do this only for lesions that are a cause for concern. So in the set of biopsy images, if an image had a ruler in it, the algorithm was more likely to call a tumor malignant, because the presence of a ruler correlated with an increased likelihood a lesion was cancerous. (source)

In this case, the researchers were able to overcome the mistaken association “ruler indicates cancer” because (a) their model was sufficiently interpretable to highlight the learned associations and (b) their domain knowledge alerted them this association was illogical.

3. Is silent failure tolerable?

Here’s a scary situation we encounter at Opendoor, the company where I work. We build a new home valuation model that advances our accuracy in backtesting and A/B testing. We deploy it at scale, which means we start acquiring homes using the prices our model predicts, taking on significant balance-sheet risk. Then the model fails silently.

By “fails silently”, I mean the model continues to be accurate in backtesting, but its real-world accuracy has decreased. There are so many reasons a model can fail silently, that it’s unrealistic to expect good engineering to prevent all such failures from happening. One prominent silent failure mode is training-serving data mismatches. Sticking with our Opendoor example, this could occur when we update a prompt on a website form, so the nature of the data we collect from home sellers changes at serving time, but there is no corresponding change in our training data.

Given silent failures cannot be entirely avoided, the key question is: how much damage from a silent failure can be tolerated? The damage from a silent failure is proportional to the failure’s duration, and duration is proportional to how quickly degradation in real-world outcomes can be observed. If you’re making predictions that are one day into the future, you have to wait one day. If you’re making predictions one year into the future, you have to wait one year.

Consider content recommendation algorithms, where real-world accuracy is generally evaluated by click-through-rate. Since the clicks normally happen within a few seconds of the recommendations being be presented (at least for search applications), the feedback loop is extremely short. A silent algorithm failure can be quickly identified.

Conversely, consider an algorithm for scoring loan applications. It may take a lender months or years to see higher-than-expected default rates and identify that this is coming from a silent failure. By this point, the lender may have committed enormous amounts of capital to the bad loans, and is likely legally unable to unwind them.

Real-World Use Cases

To bring this all together, here’s how the use cases I’ve mentioned above (and some others) stack up on these three questions. This table is meant as a rough guide, not an ironclad assessment, and I expect that experts from some of the domains below will disagree with me (if you’re one, please message me and let me know why!)

I’m bullish that machine learning will change society significantly and positively. I’ve made my career decisions based on that belief. But ML is not a panacea. Like all software projects, ML model development carries significant risk of failure. I don’t want ML to suffer the fate of the Concorde, where some poor early design decisions permanently scared off the public. I hope these principles make that a bit less likely to happen.

Footnote 1: The implications of requiring a causal mental model before developing an ML model are significant. A corollary is that you should build a learned model when human experts have better-than-random accuracy and an explainable rationale.

Implication A: if you’re building a model to be faster, more consistent, and/or cheaper than experts, the low risk modeling option is to encode the expert’s explainable rationale in a rules-based model! Only explore a learned model if you need to be more accurate than experts.

Implication B: if experts are little or no better than random, a learned model is probably a fool’s errand. If experts aren’t accurate, they may not even be using the right causal inputs. And if you don’t have the right causal inputs in your feature set, even the most sophisticated learned model is doomed to fail. Situations like this include predicting stock prices and geopolitical events — even experts don’t agree on what the necessary causal inputs are.