Machine Learning for Decision Making

Best of Both Worlds, Part 1

Emily Glassberg Sands
Teconomics
6 min readNov 18, 2017

--

This post was co-authored with Duncan Gilchrist and is Part 1 of our “Best of Both Worlds: An Applied Intro to ML for Causal Inference” series (Part 2 here).

We’re grateful to Evan Magnusson for his strong thought-partnership and excellent insights throughout.

Photo cred: Andy Kelly

Over the last couple years, we’ve been excited to see — and leverage — a range of new methods that significantly improve our ability to glean causal relationships from data, especially big data. Many of these marry the best of machine learning and econometrics to unlock deeper and more correct inference. Applied correctly, they help us get the insights we need to make better decisions for our companies and our communities.

Since most of the methods are not yet broadly used in industry, we’re kicking off this “Best of Both Worlds” series — an applied intro to machine learning for causal inference. Each post covers a concrete inference challenge that we’ve seen in tech, and 1+ solution powered by the intersection of machine learning and causal inference. The goal is to arm the reader with the understanding she needs to start recognizing and applying these methods in her own context. To facilitate application, we also include companion R code in Jupyter notebooks on GitHub.

This series is likely most relevant to those identifying and measuring how various systems work — current and aspiring data scientists who are teasing out causal relationships to inform product and business direction, empirical economists working with high-dimensional data, and computer scientists looking to go beyond prediction to inference. Warning: Not intended for theorists.

Before jumping into specific methods, a bit of context on ML vs. Econometrics, some of the challenge with more traditional causal inference methods, and why we’re excited about ML for better inference.

The Classic Divide: Machine Learning vs. Econometrics

Photo cred: Johannes Ludwig

Traditionally, the machine learning toolkit and the econometric toolkit are used to answer distinct questions: While machine learning centers around prediction, econometrics — causal inference especially — centers around decision-making.

Let’s take an example. In tech, there’s often a focus on figuring out what “magic moment” hooks new users for months and years to come — What simple milestone leads to long-term customer loyalty?

Former Facebook executive Chamath Palihapitiya famously explained that once Facebook discovered that users would retain if they reached 7 friends in 10 days, making that experience a reality for all users became a “single sole focus” for the company. Twitter, Zynga, Dropbox, Slack, and others have mentioned identifying similarly magical moments in their own contexts. How do we find our own?

If we want to predict whether or not a user will stick around for say the next year based on her behavior in the first month, we leverage machine learning techniques. In the simplest approach, we train a classification model to predict retention as a function of the (many possible measures of) first-month behavior. Using regularized regression techniques, we should be able to narrow down to a reasonable set of behaviors that powerfully predict retention. This can be helpful for estimating the expected lifetime value of the user, for example to optimize marketing spend.

Photo cred: Austin Ban

But the magic moment isn’t about predicting retention at all. Instead, it’s about determining the set of first-month behaviors that causally drive retention. In Facebook’s case, if users who acquire more friends at the start are more likely to retain largely because they are just inherently different (e.g., more social, more innately interested in the product, more addicted to technology), then making strategic product decisions to invest in early friending based on the observational correlation between early friending and retention would yield under-whelming business results.

Econometric techniques are the main tools in our toolkit for determining causality. So, instead of focusing on prediction, one place we might start would be to run a logistic regression model, regressing one-year retention on first-month behavioral features and adding in a host of intuitive controls for major confounders that might influence both like the user’s referral source and other proxies for the user’s quality and inherent interest in the product.

By the way, if you’re in tech you may be thinking, ‘Hey, omitted variable bias can be pretty hard to get rid of with controls alone. So for the causality one, can’t we just AB test all the potential experiences to give users in the first month in order to make them more sticky over time?’ We could in theory, yes, but building and testing all the things would be pretty inefficient.

Photo cred: Em Sands, late at night, on memegenerator.com

When used correctly, experiments are a powerful tool, but it simply isn’t feasible to test everything. By first applying econometric techniques on historical observational data we are able to bubble up through data the experiences most likely to be powerful, and then just build and run experiments on those. Think of it as hypothesis-driven testing on steroids: causal inference on historical data helps us narrow down the hypothesis set worth experimenting on — and can also surface hypotheses we might not have thought of on our own.

Of course, controlled regression is just one approach. There are a range of other more sophisticated econometric approaches for estimating causal relationships from natural variation including regression discontinuity design, difference-in-differences, fixed effects modeling, and instrumental variables modeling. (See our previous post, 5 Tricks When AB Testing Is Off The Table for an overview.)

The Challenge & Opportunity

While each of these econometric methods is powerful for uncovering causal relationships, one of the major challenges with the standard techniques is that model and variable selection are relatively unprincipled.

Photo cred: Dennis Kummer

And “relatively unprincipled” becomes a bigger deal in a world with millions of observations, just as many features, and a whole host of seemingly reasonable counterfactuals. Data scientists can still build their models in an ad-hoc way, but that becomes more and more impractical as the data scales and, as we’ll see in the next post, is prone to suboptimal results and decisions. Moreover, we might aspire to build these decision systems to be able to iterate without any human interaction at all — and for that to happen, the machine has to be able to make these decisions in a principled fashion on its own.

The good news is that methods for principled model and feature selection are core to machine learning. And in recent years academic visionaries like Victor Chernozhukov, Susan Athey, Guido Imbens, Alberto Abadie, and others have been carefully developing adaptations that marry the best of what machine learning has to offer with the causal inference applications that matter most.

In future posts, we’ll cover ML methods for instrumental variable selection, heterogenous treatment effects, synthetic controls, and more. But first, let’s jump right into two simple but powerful ML adaptations that can accelerate your decision making — including time to resolve AB tests: Double Selection and Double Machine learning, both for principled covariate selection (Part 2 of this series!) .

--

--