Designing Experimentation Guardrails
Introducing the Experiment Guardrails framework we implemented at Airbnb, which helps us prevent negative impact on key metrics while experimenting at scale.
Each week, thousands of online experiments run concurrently on the Airbnb platform to measure the impact of potential product changes monitoring approximately tens of metrics per experiment. When making launch decisions, each team is often focused on different evaluation criteria — for example, the Trust team prioritizes Fraud Identification, while the Experiences team may prioritize discovery of the Online Experiences product in our Homepage. Experiments that positively impact one team’s metrics can also harm another team’s metrics, and it’s not always obvious how to weigh these trade-offs — for example if house rules are not displayed in Checkout we might see an increase in bookings but lower ratings. In the worst case, a team might discover that another team recently launched a treatment that does significant harm to one of their key metrics without a proper analysis of the tradeoffs, requiring a roll-back of the new changes.
Introducing the Experiment Guardrails System
To help our dozens of experiment-running teams ensure their launches don’t harm our most important metrics, we rolled out a company-wide Experiment Guardrails system in 2019. This system helps protect key metrics by identifying potentially negative impacts prior to launch. If a team wishes to launch an experiment that has “triggered” a guardrail (that is, where our guardrails system has found that the treatment negatively impacts a key metric, or is underpowered to ensure it does not have a substantial risk of negatively impacting a key metric), they will go through an escalation process first, where stakeholders can discuss the results transparently.
Selecting Metrics to Protect
Guardrail Metrics are ones that are important to the company as a whole. While a feature does not necessarily need to improve a Guardrail Metric to be considered successful, all launches are meant to avoid having substantial negative impact on the guardrails.
We found that useful Guardrail Metrics tend to fit into one of three categories:
- Key business/financial metrics that represent overall company performance — for example, revenue.
- User experience metrics that capture how it feels to use the product — for example, bounce rate or page load speed.
- Strategic priority metrics that focus on areas of strategic importance to the company — for example, Seats booked for Experiences. These may change over time as the company’s strategy evolves.
While it may be tempting to guardrail every team’s favorite metric, it’s important to keep in mind that more guardrail metrics doesn’t necessarily mean better — there is a trade-off between how many metrics are protected, how thoroughly they are protected, and how much friction is added to the product development process. For example, if we choose 50 metrics and alert on any degradation that is significant at the 0.05 level, then we would have at least one false alert 92% of the time in an AA test.
Defining The Three Guardrails
Our system consists of three guardrails that each must be passed individually for an experiment to launch without escalation:
- The Impact Guardrail requires that the global average treatment effect is not more negative than a preset threshold. This guardrail protects against large negative effects regardless of statistical significance.
- The Power Guardrail ensures the experiment has been exposed to enough users so the Impact Guardrail has a reasonable false positive escalation rate and power.
- The Stat Sig Negative Guardrail provides additional protection for metrics where any statistically significant negative impact — even if it’s small in magnitude — would warrant escalation.
We’ll walk through each of these in detail below:
The Impact Guardrail
The Impact Guardrail escalates an experiment if:
where percent change is the relative change of the means and t is the escalation threshold. For example, if t is 0.5%, an experiment that has an impact more negative than -0.5% will be escalated.
Note that throughout this blog post, we assume we are working with metrics where an increase is desired (e.g., revenue). For metrics where a decrease is desired, the relationship with percent_change should be flipped (e.g., customer service tickets should escalate if percent_change > t)
The Power Guardrail
The Power Guardrail impacts experiment runtime. It requires the standard error for our estimate of the percent change to satisfy:
This is to ensure the Impact Guardrail has reasonably good power and false positive rate (FPR). If an experiment just meets the power guardrail for a single metric, it will have the following profile for that metric:
If an experiment runs longer than is required by the Power Guardrail, the FPR gets smaller and the power to detect an impact of -0.8*t gets larger since standard error gets smaller The more metrics an experiment includes, the larger the experiment-level FPR will be at any given runtime.
The constant 0.8 can be adjusted down if you want a tighter set of Power and FPR requirements, or adjusted up if you can tolerate lower Power and higher FPR. Keep in mind that the lower the constant, the longer experiments must run to pass the Power Guardrail. A good way to evaluate the practicality of your Power Guardrail is to run a backtest: What percent of experiments launched in the past 6 months would have passed the guardrail as-is? For the ones that did not pass, how much longer would they have needed to run? You’ll want to consider how implementing your Power Guardrail might affect experiment run times across your organization.
The Stat Sig Negative Guardrail
The Stat Sig Negative Guardrail escalates an experiment if it shows a statistically significant negative impact on certain metrics:
Most likely, you won’t want to apply the Stat Sig Negative Guardrail to all metrics, as some may not warrant escalation for a small negative impact. For example, a 0.1% degradation (increase) in page performance is undesirable, but probably not worth escalating. On the other hand, a 0.1% degradation (decrease) in revenue at a company the size of Airbnb could potentially translate to millions of dollars. This guardrail is an extra safeguard for your top metric(s) where even a small negative impact should be surfaced.
Adjusting for Global Coverage
We define the global coverage of an experiment to be the % of airbnb visitors that are assigned into the experiment. If all experiments had the same Power Guardrail, low-coverage experiments would have a harder time passing it. To allow all experiments to pass the Power Guardrail in roughly the same time, we allow t to vary with global coverage, the percentage of the total value of the metric that is covered by subjects assigned to the experiment, in the following manner:
We set an Escalation Parameter T for each metric, which represents the percent change that will always trigger an experiment with 100% global coverage. We then let the escalation threshold t vary depending on coverage:
Because the Impact Guardrail also uses the coverage-adjusted t, all experiments that just pass the Power Guardrail will see the same power / FPR profile as outlined in the Power Guardrail section.
In the table below, you can see how escalation thresholds for percent change and global impact vary by coverage. As coverage decreases, the threshold on percent change increases — this is desirable, as it makes the Power Guardrail pass rate similar across coverage levels. The Threshold for global impact, on the other hand, decreases with decreasing coverage — this is also desirable, as we should be tougher on lower-coverage experiments in terms of global impact.
Putting It All Together
All together, the three Guardrails protect our most important metrics by escalating negative impacts that we’ve deemed meaningful, while ensuring we have appropriate power to detect it. Visually, we can see the guardrails at work across point estimate and standard error for an example metric:
The Power Guardrail is represented as the horizontal line, and an experiment with a larger standard error would need to continue running until StdError < 0.8 * t. An experiment owner can also choose to escalate prior to reaching the required standard error, if the estimated run time is too long.
The Impact Guardrail is represented as the vertical line at -T, and an experiment with an impact more negative than -0.5% would require escalation prior to launch.
Finally, the Stat Sig Negative Guardrail is represented as the diagonal line, where the percent change is negative and p-value = 0.05. Metrics with this guardrail enabled will escalate experiments with a statistically significant negative effect.
For all remaining experiments, we are comfortable that there is low risk of meaningful negative impact, and they are able to launch without escalation.
Refining the guardrails — Cases with automatic approval
To refine the set of experiments that require escalation, we have carved out some cases where we automatically approve experiments even if they trigger one of the guardrails above. Here are two major refinements we have made:
- For some metrics where it’s easy for us to reach statistical significance, but where only large changes in the metric are material for our business, we ignore the Stat Sig Guardrail and only apply the Power and Impact Guardrails.
- For experiments that have not yet passed the Power Guardrail, but have positive point estimates, we allow these to pass without escalation if they pass a noninferiority test with the same threshold as our Power Guardrail (which you could view as a relaxation of the power guardrail). In particular we let an experiment pass if the lower bound of the confidence interval satisfies:
This allows treatments with positive point estimates to pass before reaching the power guardrail.
Making It Your Own
This Guardrails system is extremely configurable, and the parameters can be adjusted to fit your needs. We’ve discussed how you can adjust the 0.8*t multiplier and refine the guardrails in some cases.
The most important decision is setting Parameter T for each Guardrail metric, which is as much an art as it is a science. A tighter T will allow you to catch smaller negative impacts, but it will also make the Power Guardrail harder to pass. You should set T as the larger of the two between “What impact is worth escalating for?”, and “What impact is feasible to detect?” This ensures that the guardrails neither require egregious runtime, nor waste escalations on low-magnitude impacts.
Another factor to consider is how many Guardrail metrics you have.The more metrics you cover, the higher the overall escalation rate (and false positive rate) will be. If your organization experiments often, there will be a limit to the number of experiments that can be realistically reviewed. Requiring escalations slows down the speed at which you iterate and launch, so you want to find a balance between protecting your metrics and moving quickly. Once again, the best way to decide is to look at historical data — evaluate different sets of metrics and T values by estimating the resulting overall escalation rates, and decide what configuration is best for your organization’s needs.
Experimentation at scale can be challenging, especially when multiple teams may be focused on different goals. We introduced an Experimentation Guardrails system to help bring visibility into Airbnb’s most important metrics and to protect them from potentially harmful launches. The system flags roughly 25 experiments per month for escalation/review. Of these 80% eventually launch after discussion between stakeholders and additional analysis, and 20% (5 per month) are stopped before launch. Our configurable system allows us to balance safeguarding our metrics and maintaining a nimble product development process, and we hope this can be a useful reference for thinking through guardrail systems of your own. Happy experimenting!