How Airbnb Measures Future Value to Standardize Tradeoffs

The propensity score matching model powering how we optimize for long-term decision-making

Introducing Future Incremental Value (FIV)

We are interested in the long-term causal effect or “future incremental value” (FIV) of an action or event that occurs on Airbnb. We define “long-term” as 1 year, though our framework can adjust the time period to be as short as 30 days or as long as 2 years.

The Science Behind FIV

To minimize selection bias in estimating the FIV of an action, we need to compare observations from users or listings that are similar in every way except for whether or not they took or experienced an action. The well-documented, quasi-experimental methodology we have chosen for this problem is propensity score matching (PSM). We start by separating users or listings into two groups: observations from those that took the action (“focal”) during a given timeframe and observations from those that did not (“complement”). Using PSM, we construct a “counterfactual” group, a subset of the complement that matches the characteristics of the focal as much as possible, except that these users or listings did not take the action. The assumption is that “assignment” into focal versus counterfactual is as good as random.

Figure 1. Overview of methodology behind FIV
  1. Generate the Propensity Score: Using a set of pre-treatment or control features describing attributes of the user or listing (e.g., number of past searches), we build a binary, tree-based classifier to predict the probability that the user or listing took the action. The output here is a propensity score for each observation.
  2. Trim for Common Support: We remove from the dataset any observations that have no “matching twin” in terms of propensity score. After splitting the distribution of propensity scores into buckets, we discard observations in buckets where either the focal or complement have little representation.
  3. Match Similar Observations: To create the counterfactual, we use the propensity score to match each observation in the focal to a counterpart in the complement. Various matching strategies can be used, such as matching in bins or via nearest neighbors.
  4. Results: To get the FIV, we compute the average of the outcome or target feature in the focal minus the average in the counterfactual.

Evaluation

In a supervised machine learning problem, as more data becomes available and future outcomes are actualized, the model is either validated or revised. This is not the case for FIV. The steps above give us an estimate of the incremental impact of an action, but the “true” incremental impact is never revealed. In this world, how do we evaluate the success of our model?

Adapting FIV for Airbnb

While PSM is a well-established method for causal inference, we must also address several additional challenges, including the fact that Airbnb operates in a two-sided marketplace. Accordingly, the FIV platform must support computation from both the guest and the listing perspective. Guest FIV estimates the impact of actions based on activity a guest generates on Airbnb after experiencing an action, while listing FIV is from the lens of a listing. We are still in the process of developing a “host-level” FIV. One challenge in doing so will be sample size: we have fewer unique hosts than listings.

The Platform Powering FIV

FIV is a data product and its clients are other teams within Airbnb. We provide an easy to use platform to organize, compute, and analyze actions and FIVs at scale. As part of this, we have built components that take in input from the client, construct and store necessary data, productionize the PSM model, compute FIVs, and output the results. The machinery, orchestrated through Airflow and invisible to the client, looks as follows:

Figure 2. Overview of FIV Platform

Client Input

Use cases begin with a conversation with the client team to understand the business context and technical components of their desired estimate. An integral part of producing valid and useful FIV estimates is establishing well-defined focal and complement groups. Additionally, there are cases when the FIV tools are not applicable, such as when there is limited observational data (e.g., a new feature) or small group sizes (e.g., a specific funnel or lever).

Figure 3. Example of an FIV config that would be submitted by a client

Data Pipeline

The config triggers a pipeline to construct the focal and complement, join them with control and target features, and store this in the Data Warehouse. Control features will later serve as inputs into the propensity score model, whereas target features will be the outcomes that FIV is computed over. Target features are what allow us to convert actions from different contexts and parts of Airbnb into a “common currency”. This is one of FIV’s superpowers!

Figure 4. Steps to compute the raw data needed for FIV, after taking in client input

Modeling Pipeline

Because the focal and complement groups can be very large and costly to use in modeling, we downsample and use a subset of our total observations. To account for sampling noise, we take multiple samples from the output of our data pipeline and feed each sampling round into our modeling pipeline. Sampling improves our SLA, ensures each group has the same cardinality and allows us to get a sense of sampling noise. Outliers are also removed to limit the noisiness of our estimates.

Figure 5. Modeling steps needed to compute FIV, after the raw data has been generated

FIVs!

Next we pull our FIVs into a Superset dashboard for easy access by our clients. FIV point estimates and confidence intervals (estimated by bootstrapping) are based on the last 6 months of available data to smooth over seasonality or month-level fluctuations. We distinguish between the value generated by the action itself (tagged as “Present” below) and the residual downstream value (“Future”) of the action.

Figure 6. Snapshot of the dashboard as seen by clients

FIV as a Product

Airbnb’s two-sided marketplace creates interesting but complicated tradeoffs. To quantify these tradeoffs in a common currency, especially when experimentation is not possible, we have built the FIV framework. This has allowed teams to make standardized, data-informed prioritization decisions that account for both immediate and long-term payoffs.

Acknowledgments

FIV has been an effort spanning multiple teams and years. We’d like to especially thank Diana Chen and Yuhe Xu for contributing to the development of FIV and the teams who have onboarded and placed trust into FIV.

--

--

Creative engineers and data scientists building a world where you can belong anywhere. http://airbnb.io

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store