Using experiments to tackle a large refactor with confidence

Horace Ko
Horace Ko
Mar 1, 2017 · 5 min read

In early 2015, the Airbnb Engineering team decided to embrace React as its canonical front-end view framework. We’ve since built a significant amount of tooling around React to make it as pleasant an environment to develop in as possible, and we’ve contributed a lot of these tools back to the open source community.

Unfortunately, because our search page was written using a largely unmaintained framework (Twitter’s Flight) that wasn’t used anywhere else on our site, it couldn’t benefit from these investments in tooling and the accumulation of institutional knowledge around using React. The search page code also lacked a comprehensive suite of tests. As such, it was often much more difficult (read: slower) to work in the search page codebase compared with other codebases of similar size.

In early 2016, we decided to start refactoring the search page into React. The simple UX of the search page belies its complex implementation; often, there are numerous external teams running experiments on the page, not to mention the various locale- and market-specific customizations. This complexity, paired with the thin test coverage, meant that any substantial refactor would likely cause some regressions in behavior. Since the search page is at the top of the guest funnel, we needed to be absolutely certain that regressions would be minimized, to reduce any negative impact to our core business. In this blog post, we talk about how we were able to use experiments to confidently launch the refactored search page.

Enter ERF

Experiment Reporting Framework (ERF) is an in-house tool developed by our Data Tools team which simplifies the tasks of experiment setup, data analysis, and results visualization. It lets us perform split-tests and analyze the impact of each treatment according to various metrics, including ones that tie in to our core business. We also have the ability to segment these metrics by dimensions like locale, country, browser, and platform; this proved to be invaluable in narrowing down regressions, as we’ll talk about later. ERF is used in virtually every product and feature launch at Airbnb.

ERF dashboard for an experiment

The idea here was to use ERF to split-test between the original code and the refactored React code. If the refactored code contained a regression, our hypothesis was that it would have a meaningful impact on key metrics, which we would see on the experiment dashboard. In determining the scope of each experiment, we tried to strike a balance between running fewer experiments (since the data collection phase often took some time), and keeping each experiment change-set as small as possible (to make it easier for us to isolate regressions).

Setting up

Like in React, Flight encapsulates behaviors into components and establishes a component hierarchy, and the component hierarchy introduced by Flight largely overlapped with what we imagined it to be in React. This made the process of refactoring easier; our approach was thus to reimplement a Flight component in React and split-test between the two using ERF.

Until the Flight components were completely replaced with React components, Flight was also responsible for managing all the data flows, so we also had to build an interoperability layer between Flight and React; this took the form of higher-order component (HOC) wrappers around each refactored component that translated Flight events to React prop and state changes and vice versa.

With that scaffolding and shimming complete, we could move on to refactoring the components themselves and testing them with experiments.

Catching regressions

A component that was targeted for refactoring was the listing card, which we make liberal use of on the search page.

An example listing card

Listing cards are one of the most complex components on the search page; there are many behaviors obscured behind many code paths, so it wasn’t surprising that a refactor would fail to port some behaviors over. Sure enough, when we ran the experiment comparing the original listing card to its refactored counterpart, we saw a dip in views of the listing page from the search page. Segmenting it by platform revealed an interesting pattern — that the dip was isolated to the iPhone and Android platforms:

ERF dashboard showing a regression in listing page (P3) views for mobile platforms

Our listing cards are responsive components — a listing card on a small (read: mobile) breakpoint will behave slightly differently from a listing card on a large (read: desktop) breakpoint. The old listing card’s behavior on small breakpoints opens a new tab to the listing page when clicked or tapped.

Digging into the code after ERF surfaced the issue, we discovered that the refactored React listing card didn’t implement this new tab behavior for small breakpoints; it opened the listing page in the same tab. Re-running the experiment with that fix showed a marked improvement in the metrics:

ERF dashboard showing that the regression was fixed

The regressions specific to the mobile platforms have disappeared; those numbers are now neutral. As additional remediation, we wrote regression tests to make sure we won’t break that tab-opening behavior in the future.

We used this validation methodology throughout the rest of our refactoring, and were able to uncover several other regressions which, in aggregate, would have had a substantially negative effect on our core business.


There are several caveats to using experiments to validate a refactoring. There’s the possibility that the effect from regressions may not be detected because the relevant metrics haven’t yet been developed. In our case, we were reasonably confident that the coverage from our suite of metrics was sufficient for our purposes; the suite has seen contributions from numerous product teams over the course of several years, instrumenting things as diverse as page load performance and support tickets created.

Another caveat is that this validation strategy is generally more effective on pages that have high throughput — it takes less time to collect enough data to achieve significance. As the search page is at the top of our guest funnel (and thus receives a significant amount of traffic), it worked fine for us.

There’s also a lag time between launching the experiment and collecting enough data to be able to make an informed decision — we needed to carefully pipeline work so that we weren’t sitting idle while we were waiting for data to come in.


Using experiments to validate refactored code turned out to be an invaluable approach for surfacing issues, which allowed us to fix them and confidently and incrementally launch the refactor without impacting the core business. It’s a validation strategy that we can adopt for future refactorings of critical high-throughput user-facing flows.

Airbnb Engineering & Data Science

Creative engineers and data scientists building a world where you can belong anywhere.

Thanks to Adam Neary, Harrison Shoff, and Phil MacCart

Horace Ko

Written by

Horace Ko

Airbnb Engineering & Data Science

Creative engineers and data scientists building a world where you can belong anywhere.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade