Testing at Airbnb
By Lou Kosak
When I was considering an offer from Airbnb early last year, one of the things I made sure to ask about was testing. At my last startup I’d worked hard to cultivate a culture of testing; my team took pains to make sure that everything we shipped had a full test suite, and we often built new features using test driven development. I wanted to make sure that I wasn’t going to be joining a team that didn’t care about testing, or worse, actively opposed it.
When I posed the question to a few Airbnb engineers, the consensus seemed to be, “well, we all really believe in testing, but we’ve got a long way to go”. I got the impression of a team that was slowly trying to bring order to a large, monolithic application, and saw testing as an essential part of that transformation.
Nearly a year later, I’m amazed by the progress we’ve made. We went from a handful of brittle tests to a large, resilient suite. We’ve gutted our continuous integration stack and reduced our build times from 1 hour+ to about 6 minutes, while simultaneously handling many times the number of builds. Most importantly, we’ve gone from an engineering culture in which most new code shipped without a single test, to one in which a proposed change without tests will quickly be called out and corrected. In short, we’ve not only become a team that believes in tests; we’ve become one that actually writes them.
For the payments team, where I’ve spent most of my time, this transformation has been especially significant. Airbnb processes huge volumes of transactions in dozens of different currencies across a wide array of payments processors every day, and a small undetected bug could have a huge impact on our guests and hosts. Deploys used to be harrowing for us; verifying that a change wasn’t introducing regressions was often a matter of manual testing and stats watching. Now, our test suite is the largest in the codebase, and we can generally trust that a green build truly means we haven’t broken anything.
I’d like to share with you how we managed this evolution. First, I’ll talk about how we changed our culture to make testing a first class citizen. Next, I’ll discuss some of the tooling we’ve put in place to make it easier for our team to write and run tests. Finally, I’ll talk about some of the testing challenges we’ve faced while working with specialized code and an increasingly service oriented architecture.
Building a Movement
At a tiny startup, changing a team’s behavior is relatively easy. You and your team sit down together, discuss the new behavior you want to adopt, and then adopt it. When you have dozens of engineers spread across different teams, however, changing how people work requires a bit more strategy. One approach is to rule by edict, but given our culture of engineer autonomy this would have been very poorly received. The other approach is to lead by example and build a movement. For us, this movement began with pull requests.
The PR Spring
In the early days of Airbnb, as at many startups, most commits were pushed directly to master; if you wanted somebody to review your changes, you’d call them over or IM them a link to your (usually already merged) SHA. With deploys going out many times a day, this resulted in a lot of questionable code hitting our production servers, and our uptime suffered for it. As we grew, this process became increasingly problematic.
Eventually, a few people decided to do something about it and started submitting pull requests for their changes. This was never introduced as a mandatory policy; we never disabled pushing to master or shamed people for doing so. But as those few, then a team, then several teams started doing this, two things happened. First, it became clear that this process of peer review lead to less bad code hitting production, and therefore fewer outages. Second, it began to seem horribly old-fashioned to push directly to master. This was accelerated by team growth: since each new engineering hire was briefed on the importance of PRs, the percentage of the team using them kept increasing regardless of whether existing engineers made the switch or not. Eventually, even the cowboys of the old guard started to feel sheepish when they snuck changes into master.
Adoption of pull requests held a number of advantages for our team. It improved our stylistic consistency, gave us a forum to discuss code structure and architectural decisions, and increased the likelihood that typos and logical errors would be caught before they reached our users. By acting as a channel through which all new code must pass, it also gave individuals on the team much greater visibility into what was shipping. This increased visibility, in turn, enabled us to begin a cultural transformation around testing.
The Dawn of Testing
Airbnb has had a spec/ folder in place on our primary Rails app for quite a while, but the state of affairs a year ago was pretty grim. The suite was slow and error-prone, our CI server was barely limping along, and most people had no idea how to run tests locally. New hires were told that testing was important, but when they saw that nobody was paying any attention to tests they quickly forgot about them. When those of us most committed to testing discussed this last spring, it was clear that a multi-pronged strategy was necessary to start turning things around.
First, we needed to take a lesson from the PR rollout and lead by example. This meant including tests with any PRs we submitted, whether it was new feature development, bug fixes, or refactors. The increased visibility provided by pull requests meant that many people would see this happening.
Second, we needed to start educating the team: we spoke in engineering meetings, showed teammates how to write tests, held office hours, and shared links and recommended reading. I began publishing a weekly newsletter of testing news and PR highlights, showing off well-written specs and giving people props for including tests with their changes.
Third, we needed to exploit our growth and make new hires into champions of testing. I revamped our testing bootcamp and emphasized the importance of testing in fighting cruft and keeping us from being trapped under the weight of a monolithic codebase that we couldn’t safely refactor. I explained how much progress we’d already made, and encouraged our new engineers to bear the torch. As adoption of testing increased, the sense of inertia lifted and new hires began to treat test writing as a matter of course. As with PRs, the old guard eventually started to be won over.
Obviously, all of this cultural shift wasn’t worth much if it was too painful to write and run tests. In the next section I’ll look at the changes we made to our tooling during this period to make testing as painless as possible.
A Bar So Low You Can Trip Over It
One of my friends at Airbnb likes to say that the bar to writing tests should be, “so low you can trip over it.” I tend to agree. As I mentioned above, this certainly wasn’t the case a year ago; at that point, the bar was well overhead — possibly hidden in a tree or a passing cloud. To start lowering it, we needed a few critical components to be in place: a way to run individual tests quickly and reliably on a dev machine, a way to run the full suite quickly and painlessly in the cloud, and a way to make sure that every PR had a passing build before it was merged.
The best way to ensure that local testing was possible was to normalize people’s dev environments. For this we chose Vagrant. This, combined with Chef, allows us to do our local dev in sandboxed Linux instances running locally via VirtualBox in a configuration as similar to production as possible. In addition to making dev environment setup much easier than it used to be, this ensures that each engineer has a consistent environment that is ready to run tests out of the box. The user SSHs into the local linux server and runs spec commands like they would on their host OS, and generally everything Just Works. Most people on our team combine this with Zeus, which allows the Rails environment to be preloaded for lightning fast (relatively speaking) test runs. Both Vagrant and Zeus have their share of issues, but in practice we’ve found them to be a huge time saver.
In The Clouds
Due to the way our primary Rails application evolved, we’ve got a lot of tests that have large dependency graphs, interact extensively with the DB, instantiate tons of objects, etc. In a single process, running the entire suite takes several hours — far longer than anybody should have to wait to find out if their changes are causing regressions. There are a ton of excellent strategies for speeding up Rails test suites — aggressive use of stubbing/test doubles, decoupling logic from models, avoiding loading Rails entirely — but given the size of our codebase and the velocity with which we’re moving, most of these weren’t immediately feasible. We needed a build system that would allow us to parallelize our test suite so that the real time taken to run the suite was manageable.
Our SRE team went through several different continuous integration solutions in the last year before settling on Solano. Each of the previous systems had some issue: instability, memory consumption, poor DB management, poor parallelization, painful web UI, you name it. What Solano gives us is an on-premise solution with excellent native support for fanning out tests to multiple threads, running them in parallel, and then assembling the results. It has a great web UI, CLI support, and impressive performance. Since we started using it, our deploy workflow has grown noticeably faster, and the number of wails and anguished GIFs from frustrated engineers is at an all-time low.
Having a CI server building all commits across all branches was a huge first step, but to make this useful we needed to surface the outcome of these builds. This is where GitHub’s commit status API comes in. Every time our CI server begins a build, it pings GitHub’s commit status endpoint, and every time it completes a build it hits the endpoint again with the outcome. Now every open PR includes a yellow/red/green indicator for the branch in question, with a direct link to the build status page on our CI server. In practice this means more transparency, faster feedback cycles, and a guarantee that every branch merged into master has a passing test suite. This integration has been a huge help in keeping our master branch green, and has thus greatly reduced our deploy times (since engineers aren’t waiting on build failures to be resolved in master).
The combination of a standardized local test environment with a fast, reliable CI server and GitHub integration has helped us move that testing bar a lot closer to the ground. Where I guess it will eventually, ideally, be tripping everybody all the time or something.
Before I joined Airbnb, I was becoming something of a testing purist. I wanted my unit tests to run in total isolation, my test setup to be as minimal as possible, and my whole suite to run as close to sub-second as possible. I watched Corey Haines’ talk at GoGaRuCo and my eyes twinkled with wonder.
These days, I’ve become a bit more of a pragmatist. The reality of automated testing with a large legacy codebase is that you’ll need to make some compromises on testing purity. That’s fine. The important thing is to test all the f***ing time. Even when it’s hard. Especially when it’s hard. (Just make sure you have a fast, parallelizing build server).
Here are some examples of hard things we’ve had to test, and how we approached them.
I recently worked on a reconciliation project for our finance team. The goal was to pull in reports from all our payments partners, normalize them, and match them to records in our system. It was a Ruby project, but for performance reasons all the heavy lifting was done with raw SQL transactions.
At first I was worried about the difficulty of testing this, and even briefly considered punting on it. Thankfully, I reconsidered; the project would have been a nightmare without the sprawling test suite I ended up building. I approached each stage of the ETL (extract, transform, load — a common design pattern in data warehousing) as a discrete unit of work, and tested each one by asserting on the initial state and final state of the database tables it was modifying. The tests were pure acceptance; they had no idea how the SQL was accomplishing the task. By treating the system’s implementation as a black box, my test suite allowed the SQL to be arbitrarily refactored (for performance or clarity) with a high degree of confidence, and also served as documentation for the behavior of a very complex system.
The Service Industry
As Airbnb has grown, we’ve started moving a lot more critical functionality out into services. In our development environments, we’re operating either full or stripped down versions of most of the services that our core Rails app relies on, but we still want to be able to run our tests in isolation. I’ve yet to see a perfect solution to this problem, but one approach that’s worked well for us is to include stub functionality in the client libraries themselves (which are generally provided by the service). It’s not a perfect solution — I don’t love the idea of test-specific behavior being introduced to the client libs, for one — but it has a couple significant benefits. First, it eliminates the need for our test suite to know very much about the services the app consumes, and second, relatedly, it ensures that potentially breaking interface changes to the service will be surfaced by test failures (since an update to the service will be accompanied by an update to the client, and therefore to its stubbed implementation). Drop me a note in the comments if you hate this idea or have a better solution.
While there’s still plenty of room for improvement, automated testing at Airbnb has been greatly improved in the last year. We would never have been able to achieve this without a dedicated grassroots effort, and this effort would have failed if it hadn’t been combined with huge improvements in our testing tooling. These initiatives, combined with education and a healthy dose of pragmatism, have gotten us to a point where testing is ubiquitous and relatively painless, and where green builds actually mean something.
Building good habits around testing is hard, especially at an established company. One of the biggest lessons I’ve learned this year is that as long as you have a team that’s open to the idea, it’s always possible to start testing (even if you’ve got a six year old monolithic Rails app to contend with). Once you have a decent test suite in place, you can actually start refactoring your legacy code, and from there anything is possible. The trick is just to get started.