Introducing Pipelines to Airbnb’s Deployment Process
Joining the last line of defense against “bad code”, via developing a method of enforcing our deployment procedures, and lessons learned about how theory and practice differ in real-world software development.
At Airbnb, downtime is not just an inconvenience — there is often substantial real-world impact on our users. As a constant reminder of this incredibly important fact, we created a prominently placed poster of a single tweet from many moons ago. Roughly paraphrased, it reads:
@AirbnbHelp Help! I’m locked outside, and can’t contact my host as Airbnb is down.
Of course, a website could go down due to various reasons. Maybe a flaky test acted up and allowed us to push bad code. Maybe a box unexpectedly failed. Maybe a large-scale vulnerability fix impacted kernel performance. Fortunately, every team at Airbnb plays their own part in ensuring this scenario doesn’t occur, making downtime exceedingly rare. On DeployInfra, we focus on the deploy box in the software development life cycle — basically, once code is written and ready for use, we’re responsible for getting it onto the appropriate servers (which covers a lot of territory!). Almost all of Airbnb’s many apps/services are shipped using our homegrown deployment service, Deployboard, which in many ways makes us the last line of defense against “bad code” (after us, the code is — by definition — live!).
With this great power comes even greater responsibility. While almost all Airbnb engineers interact with Deployboard regularly, these interactions are typically short and more of a means to an end. As a result, we cannot rely on familiarity as a crutch; we need to make sure the deployment process is carefully designed and potential misuse is minimized. In particular, we hold our interface to extremely high standards for ease-of-use and intuitiveness. Anything less, and mistakes run rampant.
An unnecessarily flexible deployment procedure
Of course, we aren’t perfect, and oftentimes what seems intuitive to us — as developers of the system— is less than intuitive for infrequent users. In particular, flexibility is a blessing to the experienced, but oftentimes also provides the rope to hang oneself with. This summer, we noticed this to be the case for a rather important part of Deployboard, which was indirectly causing a fair number of incidents.
First, a bit of necessary background. Airbnb has around a thousand different services, a number that will only continue to grow as we move towards SOA. Each service has its own set of “environments” — clusters of machines fulfilling a certain role — and a deployment procedure that specifies the order one should deploy code changes in. For instance, a typical code change might first be deployed to a “staging” environment, then to a “canary” environment, then finally to production.
So far, all normal. But at Airbnb, we pride ourselves on giving our engineers a great deal of freedom and flexibility, which in this case means any engineer can deploy to any environment with minimal effort (recall we wanted things to be as painless as possible!). However, in this particular context, this freedom raises two issues:
- There is nothing preventing engineers from deploying in the wrong order, e.g. deploying directly to production before deploying to staging or canary. In fact this is not uncommon; before my project, about 10% of code changes went directly to production. Note that most of these are test- or comment-only changes, so this number isn’t quite as frighteningly high as it first appears, but it’s concerning nonetheless.
- If an engineer wants to contribute to a service they don’t regularly work with (which is very common at Airbnb!), it is not easy for them to find out what the service’s deployment procedure is — documentation around this is rarely up-to-date.
Of course, these problems feed into each other. When faced with the above display, an engineer new to the service might decide that their change should end up on production eventually, so the “simplest” solution is to just deploy it there.
As I mentioned earlier, almost all Airbnb engineers regularly interact with Deployboard, and while these interactions are short, nearly all of them involve precisely this display at some point (after all, this display is how code is actually deployed!). As a result, solving this problem became an immediate priority. Thus, while we evaluated several external solutions and will likely move to one in the coming years, in the short-term we elected to implement a homegrown pipelining system within Deployboard itself.
This work needed to be done extremely carefully. As Airbnb has grown over the last half-decade, Deployboard has been right there with it, now handling thousands of deploys daily. Suffice it to say that Deployboard is quite integral to Airbnb’s operations, and so making substantial changes to it — especially in its critical sections — requires a rather steady hand. Thus began a multi-team collaboration to develop a solution proposal, a process that was meticulous, extremely cautious, and very detail-oriented; I felt almost more like a PM than an engineer for the first few weeks of my internship!
Finally, after many mockup iterations and hundreds of document comments, the end goal took shape:
Developing the pipeline
We now know what point A and point B look like, so “all” that remains is to connect the dots! As always, the unwritten zeroth step is to first divide the work into several relatively distinct steps:
- Develop a specification for a configuration file that service owners can use to define their deployment procedures (i.e. pipeline).
Since we treat configuration as code, this makes developing and later updating a pipeline configuration as simple as deploying it!
Since deploys are themselves handled by pipelines, this approach does raise an important subtlety: it is possible that a pipeline is called upon to update its own configuration. In principle, this is not a problem at all, but it does mean that if a pipeline somehow obtains a broken configuration, it cannot be fixed via direct methods. As a result, this step requires highly defensive programming to avoid such a scenario, as well as developing tooling to “rescue” a pipeline should something slip through the cracks.
2. Save pipeline configurations in a database.
This raises a lot of database design questions. What tables do we need? How do they interact? How do we represent dependencies between targets? The former two are relatively simple; we can have a table of pipelines, referencing a table of pipeline stages, etc. The last one, however, requires slightly more effort in a link table. A quick example:
Here, actors and films are mapped in a many-to-many relationship via a separate table, whose rows contain both an actor ID and a film ID. Similarly, we can map targets to targets. Essentially, this is a database-friendly way of representing a sort of adjacency list.
3. Give the frontend a method of accessing this data.
Fortunately this one is actually simple; we can simply expose pipelines as part of a service’s Rails presenter; in other words, we produce a JSON equivalent of the information we want to retain about a pipeline. Since that information of course includes the stages and dependencies involved in the pipeline, this recursively requires presenters for pipelines, stages, links, etc.
4. Separate and order the targets correctly in the UI display.
Ignoring some edge case details surrounding stage equality, this boils down to correctly ordering a DAG, which is of course most easily done via topological sorting (a typical application of DFS). Essentially, this is a way of ordering our stages so that later stages depend only on earlier stages, which gives us a guaranteed order we can process them in without running into dependency issues. A nice side benefit of this algorithm is implicit cycle detection, which means it can be reused as part of the config validation (we should, of course, reject configurations that contain cycles).
5. Disable targets that shouldn’t be deployed to (i.e. it has dependencies that have not yet been deployed to).
This one sounded probably the simplest, since we already have a DB table for deploy, but turned out to be by far the hardest change. As it turns out, we have so many deploys that retrieving the deploys associated with a change in the naive way is prohibitively slow. This problem persisted even after adding the appropriate indices, which was quite surprising.
In the end, the solution was a little more intricate and required exposing a new API endpoint (essentially allowing the information to be accessed on-demand, rather than preloaded, distributing the database access pattern to be acceptably fast), which turned out to be a blessing in disguise as some other projects required this basic functionality as well.
6. Finally, implement the UI.
There isn’t too much to say here — this is fairly standard frontend work involving CSS tinkering and some React components. One interesting note, however, is the aforementioned attention paid to intuitive design — at one point, we spent several days discussing which icon to use: a blue circle or a gold star! While this perhaps sounds unnecessary, it was in fact of paramount importance — this icon was, essentially, to represent “what to do next”, which is perhaps the most important information the UI should convey.
(If you’re curious: in the end we chose the blue circle for consistency with the rest of the UI)
Of course, completing the technical work wasn’t the end of the project — there’s no value in developing a solution if nobody uses it! Thus the last few weeks of my internship were dedicated to promoting the project to various Airbnb teams, requiring a flurry of demos, presentations, and many, many emails. In the end, a number of key services — including Deployboard itself! — ended up adopting deploy pipelines, substantially improving the developer experience.
Overall, my internship with Airbnb has been an incredible learning experience, both technically and otherwise. From organizing lightning talks to giving demos to writing this blog post, there was plenty to do here other than sitting in my own little corner of the world churning out code, which has been fairly unusual for internships in my experience. This is no coincidence — the culture here is uniquely welcoming and encouraging of communication (no surprise, considering the hiring emphasis on core values). I’m definitely looking forward to being back in the future!
Airbnb is constantly seeking outstanding people to join our team! If you are interested in interning / working at Airbnb, please check out our job postings at https://www.airbnb.com/careers/departments/engineering and send an application!