Artificial Counterfactual Estimation (ACE): Machine Learning-Based Causal Inference at Airbnb

zhiying gu

Published in

The Airbnb Tech Blog

10 min readMar 16, 2022

By: Zhiying Gu, Qianrong Wu

Summary

What if you wanted to measure the impact of a change to your business, but it was not possible to run a randomized controlled experiment? That’s exactly the problem we faced when measuring the benefit of a new tool used by Airbnb operations to automate part of their workflow. Due to organizational constraints, it was simply not possible to randomly assign the tool to operations agents; even if we could make random assignments, the sample sizes were too small to generate sufficient statistical power. So what did we do? We imagined a parallel universe in which the operations agents who did not use the new tool were identical in all respects to those who did–in other words, a world in which the assignment criteria were as good as random. In this blog post, we explain this new methodology, called ACE (Artificial Counterfactual Estimation), which leverages machine learning (ML) and causal inference to artificially reproduce the “counterfactual” scenario produced by random assignment. We’ll explain how this works in practice, why it is better than other methods such as matching and synthetic control, and how we overcame challenges associated with this method.

The Non-Randomizable Operations Problem

There are two key assumptions undergirding randomized controlled experiments (often referred to as “A/B tests”):

The treatment and control groups are similar. When you have similar groups, outcomes are independent of group attributes such as age, gender, and location, meaning that any difference between the groups can be attributed to a treatment that was received by one group but not the other. In statistical terms, we assume that we have controlled all confounders, thereby reducing the bias of our estimates.
The sample sizes are sufficiently large. Large sample sizes serve to reduce the magnitude of chance differences between the two randomized groups, giving us confidence that the treatment has a true causal impact. In technical lingo, we assume that we have reduced the variance of our estimates enough to give us appropriate statistical power.

Given the need for similar groups and large sample sizes when running A/B tests, any organization with operational teams presents challenges. To start, there are general concerns about unfairness and disruptive experience when running randomized experiments on operations agents. Second, the operational sites are located in different countries with varied amounts of employees, skill levels and so on so we cannot simply assign certain geographies to treatment and some to control without introducing apples-to-oranges comparison, which will lead to bias of the measurement. Finally, we have millions of customers but not millions of operations agents, so the sample size for this test is always going to be much smaller than that of other experiments.

ACE to the Rescue

With the ACE (Artificial Counterfactual Estimation), we have the next best thing to a randomized experiment. The trick is to achieve bias reduction and variance reduction at the same time using a machine learning-based causal impact estimation technique.

Causal inference is a process of estimating the counterfactual outcome that would have occurred had the treated units not been treated. In our case, we want to know how productive our operations agents would have been, had they not used the new workflow automation tool. There are many ways to construct such a counterfactual outcome, but the most common methods are:

Use the control group from a randomized controlled experiment (unfortunately, is often times not possible in our case)
Construct a group that is similar to the treated group using matching methods such as Propensity Score Matching (Weighting), Coarsened Exact Matching, or Entropy Balancing
Construct the counterfactual outcome with time-series predictions (e.g., Causal Impact Model)
Construct the counterfactual outcome as the weighted average of all non-treated units (Synthetic Control, Generalized Synthetic Control)

We can construct the counterfactual outcome by ML prediction using both confounding and non-confounding factors as features. In a nutshell, we use a holdout group (i.e., the group not treated)) to train an ML model that predicts the counterfactual outcome being not treated in the post-treatment period. We then apply the trained model to the treated group for the same period. The predicted outcome serves as the counterfactual (new control) representing the imagined scenario in which the treatment group had not been treated in the post-treatment period (Y’’ in the equation below).

In the equation above, t is the difference between the observed treatment group outcome (Y) and the predicted outcome (Y’’). It represents a naive estimate of the impact because it is biased. The following graph illustrates ACE at a high level. It has the following steps as illustrated in Figure 1:

We train a machine learning model using data from a hold out group, i.e. a group without treatment.
We apply the trained model on the treatment group to obtain the predicted outcome had we not applied treatment on this group.
The difference between the actual and the predicted outcome for the treatment group is the estimated impact.

We will flesh out the detailed challenges in a later section before its application.

Challenges of ACE, and Solutions

There are two major challenges in developing ACE: bias estimation and construction of confidence intervals.

Challenge 1: Bias estimation

The predicted outcome Y’’ from the machine learning models is often biased for two reasons, causing the estimated causal impact t to also be biased (see Chernozhukov et. al. (2018)). The two reasons for bias are 1) regularization, and 2) overfitting.

The figure below shows the ML model prediction error on 100 synthetic A/A tests, for which the estimated impact should always be zero. Clearly, however, the distribution of estimates is not centered around zero. The average prediction error is actually 2%, meaning that the ML prediction Y’’ is, on average, overestimated by 2%.

Challenge 2: Construction of Confidence Intervals

Unlike in a traditional t-test for A/B testing, there is no analytical solution for confidence intervals when we are doing ACE. As a result, we have to construct empirical confidence intervals for the estimates. To address these two challenges, we took an empirical approach to removing bias from the prediction and then constructed our confidence intervals based on that same empirical approach.

In ACE, we use A/A tests both for debiasing and for constructing confidence Intervals.

Solution to Challenge 1: Debias

One natural idea is that if we can confidently estimate the magnitude of the bias, we can simply adjust the prediction by the estimated bias. The estimation then becomes:

Practitioners can freely choose any machine learning models to use — f(X) — for the prediction of Y’’. Figure 2 shows a 2% bias for 100 A/A samples. The question is: can we say the true bias is 2%? If we can verify that the bias is systematically 2% (i.e., consistent across different A/A samples during the same periods and repeatable across different time periods), we can say bias = 2%. Figure 3 shows the repeatability of the bias estimation over time. The estimates are always biased upwards and the average estimates of bias are around 2%. Figure 4 shows the average prediction error after removing the bias (2%). With bias correction, the distribution of estimated impact is centered around zero.

Figure 3: the stability of bias estimation

Figure 4: Distribution of impact estimates based on A/A after bias correction

Solution to Challenge 2: Construct Empirical Confidence Intervals

We can use data from A/A tests to construct empirical confidence intervals and p-values.

Empirical confidence interval: to be more specific, the 95% confidence interval is constructed by looking at the distribution of 100 bootstrapped A/A samples. Given that we know the true differences of A/A tests are 0, and if 5% of estimated impacts from 100 A/A tests are outside [-0.2, 0.2] range, then we know the 95% confidence interval is [-0.2, 0.2].
Empirical p-value: we can estimate Type I error via A/A tests estimated from ML models as follows. Suppose we estimated a 3% of the impact for the treatment. P-value is to estimate the probability of obtaining an estimate that is outside [-3%, 3%] when the null hypothesis is true — there is no impact. This probably is estimated with the empirical distribution of iterative A/A tests. If the probability is 1%, we will conclude that we have at least 98% (i.e 100% — (1%*2)) confidence that the alternative hypothesis — the impact is not zero — is true.

Validation

To validate if ACE can accurately measure the impact, we further ACE to the data from a large scale randomized A/B data and compared ACE results with the A/B tests results. The result from the A/B test is considered as ground truth for validation because A/B testing is the gold standard for measurement. The results are nearly identical.

Advantages of ACE

There are several advantages of ACE over other estimation methods:

It is flexible in the choice of estimation model. We can freely choose any cutting-edge ML models to achieve desired level of accuracy, based on various use cases and data properties..
Its validity and accuracy can be easily assessed during the design phase of the measurement plan by conducting A/A tests.
It can be applied on both experimental data for variance reduction and on non-experimental data for bias correction as well as for variance reduction.
For experimental data:
- It is less prone to biases compared to regression adjustments.
- It has more power compared to stratification when the ML model has a good performance.
- It estimates the magnitude of the impacts instead of only the existence of the impacts compared to rank tests.

You’ll recall that we applied ACE to estimate the incremental benefit of a tool that helps operations agents to automate part of their workflow. We generated p-values for three different measurement methodologies: (1) classic t-test; (2) non-parametric rank test and (3) ACE non-parametric test based on the empirical confidence interval we described in the previous section. The following is a performance comparison for t-test, rank test, and ML-based methods for the same sample size, in particular, when sample size is small when we try to conduct inference with classic t-test as we do in A/B testing.

Recap

In this blog post, we explained how one can leverage ML for counterfactual prediction, using an estimation problem for the efficacy of an agent tool as our motivating example.

Combining statistical inference and machine learning methods is a powerful approach when it’s not possible to run an A/B test. However, as we have seen, it can be dangerous to apply ML methodologies if intrinsic model bias is not addressed.. This post outlined a practical and reliable way to correct for this intrinsic bias, while minimizing Type I error relative to competing methods.

Currently, we are working to turn our code template into an easy-to-use Python package that will be accessible to all data scientists within the company.

If this type of work interests you, check out some of our related positions!

Senior Data Scientist — Payments

Acknowledgments

Thanks to Alex Deng and Lo-hua Yuan for providing feedback on the development of ACE and spending time reviewing the work. We would also like to thank Airbnb Experiment Review Committee Members for feedback and comments. Last but not least, we really appreciate Joy Zhang and Nathan Triplett for their guidance, and feedback and support from Tina Su, Raj Rajagopal and Andy Yasutake.

References

Stefano M. Iacus, King, Gary, Giuseppe Porro, 2017. Causal Inference without Balance Checking: Coarsened Exact Matching, Political Analysis.
Jens Hainmueller, 2012, Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies, Political Analysis.
Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, Steven L. Scott, 2015. Inferring causal impact using Bayesian structural time-series models, Annals of Applied Statistics.
Alberto Abadie, Alexis Diamond, and Jens Hainmueller, 2010. Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program, Journal of the American Statistical Association.
Yiqing Xu, 2017.Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models, Political Analysis.
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, James Robins, 2018. Double/debiased machine learning for treatment and structural parameters, The Econometrics Journal.

****************

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.