Continuous Delivery at Airbnb

Jens Vanderhaeghe, Manish Maheshwari

Introduction

Over the years, Airbnb’s tech stack has shifted from a monolith to 1,000+ services in our service-oriented architecture (SOA). While this migration solved our problems scaling our application architecture, it also introduced an array of new challenges.

In this blog post we’ll cover the deployment challenges faced on the road to our current architecture and how we’ve solved those problems by adopting Continuous Delivery best practices on top of Spinnaker. We’ll do a deep dive into how we’ve solved such a large scale migration in a short timespan while maintaining developer productivity along the way.

From Deployboard to Spinnaker

Deployboard, Airbnb’s legacy deployment tool, was designed for a monolith having a few centrally managed pipelines. As we started moving to SOA, thousands of code changes across hundreds of service teams were being deployed. Deployboard was not designed for the SOA architecture, which is characterized by decentralized deployments. We needed something much more templated so that teams could quickly get a standard, best-practice pipeline, rather than start from scratch for every new service. Rather than continuing to build in-house solutions with siloed knowledge, it made the most sense for us to adopt open source solutions built from the ground-up for decentralized, SOA pipelines.

Spinnaker is proven at Airbnb’s scale, and beyond, by industry peers like Google and Netflix. We believe continuous delivery isn’t a problem unique to Airbnb, and decided we’d benefit from collaborating with the larger community. We chose Spinnaker as the replacement for Deployboard in part because we could bridge functionality gaps by plugging in custom logic easily, without forking the core code. Also, it was important to us that Spinnaker automated canary analysis (ACA), an extremely effective strategy in reducing the blast radius of buggy deployments.

Migrating to Spinnaker

When deciding to switch rather than evolve, we created a new problem: How do we get a globally distributed team of thousands of engineers working on thousands of services (each with their own deploy pipeline), working under business pressure to continuously improve their product and code base, to change one of the most important tools they depend on for day-to-day productivity.

We were particularly worried about the “long-tail migration problem,” where we successfully get 80% of the services migrated in the first year or so, but the remaining ones become stuck indefinitely on the old system. Having to operate in such a hybrid mode is costly, and it also is a reliability and even security risk, because the “legacy” systems (including the legacy deploy system) receive less and less attention over time.

Rather than forcing yet another new tool on our engineers, we came up with a migration strategy based on three pillars: focus on the benefits, automated onboarding, and data.

The 3 pillars of our migration strategy

Focus on Benefits

By focusing on the benefits of Spinnaker, we encouraged engineering teams to adopt Spinnaker voluntarily rather than forcing them.

We started out by manually onboarding a small group of early adopters. We identified a set of services that were prone to causing incidents or had a complicated deployment process. By migrating these services onto Spinnaker and automating their release process using a deployment pipeline with ACA, we were quickly able to demonstrate value. As we onboarded more teams, we iterated on the feature gaps between Deployboard and Spinnaker. These early services served as case studies, proving to both the rest of engineering as well as leadership that adopting an automated and standardized deployment process provides huge benefits.

These early adopters saw benefits so significant that they ended up becoming evangelists for continuous delivery and Spinnaker, spreading the word to other teams organically.

Automated Onboarding

As more and more services started adopting Spinnaker, the Continuous Delivery team could no longer keep up with demand. We switched gears and focused on building automated tooling to onboard services to Spinnaker.

At Airbnb, we store configuration as code using a framework called OneTouch. This allows engineers to make changes to the code as well as the infrastructure running their code in a single commit and in the same folder. All infrastructure changes are version controlled.

Example of a codified Spinnaker pipeline
Example of a codified Spinnaker pipeline

Following the OneTouch philosophy, we created an abstraction layer on top of Spinnaker that enables all continuous delivery configuration to be source controlled and managed by our existing tools and processes.

Today, when new services are created they get Spinnaker integration, including ACA, for free out of the box.

Data

In addition to focusing on the benefits and making it easy to onboard, we wanted to clearly communicate the value-add of adopting Spinnaker in a data-driven way. We automatically instrumented Superset dashboards for each service that adopted Spinnaker.

Example of an instrumented dashboard for a service that has adopted Spinnaker

Service owners get insight into deployment data like deploy frequency and number of regressions prevented by ACA. Most service owners saw a significant increase in deployment frequency and a marked decrease in production incidents by adopting our new tooling. By arming our users with the right data, they can more easily advocate for the benefits of adopting continuous delivery.

Clearing the final hurdle

As expected, we eventually hit an inflection point in adoption. Organic adoption slowed as we reached ~85% of deployments being done on Spinnaker.

Once we hit this point, it was time to switch our strategy again, to adopt the lagging services. Our plan consisted of the following steps.

  1. Stop the bleeding
    The first thing we did is stop any new service from being deployed with Deployboard. This kept our list of remaining services to adopt static. We did this by giving engineers ample heads-up that this change was coming.
  2. Announce deprecation date + increase friction
    We gradually increased friction when using Deployboard over Spinnaker by adding a banner and warning inside Deployboard. We also instituted an exemption process that would allow us to catch major blockers well before the actual deprecation date without hurting customer experience.
  3. Send out automated PRs for the remaining services.
    To ensure we could also help onboard services where owners are resource constrained we once again leveraged tools like our in-house refactor tool,Refactorator, to do the heavy lifting.
  4. Deprecation date and post-deprecation follow-up.
    On deprecation date, we had code in place that blocked any OneTouch deploy from Deployboard. We had some loopholes in place in case there were services that still needed to use Deployboard for emergency reasons. The exemption list allows them to temporary get access to Deployboard. Engineers on the CD team can also still deploy with Deployboard, a simple page to the on-call can quickly help service owners in this case. As of today, the number of those cases remains very minimal given the amount of preparation we’ve done.
By adding a banner to Deployboard recommending engineers to adopt Spinnaker, we were able to drive adoption more quickly.
Example of an automated Pull Request that migrates a service from Deployboard to Spinnaker with minimal engineering effort.

Future Plans and Opportunities

Now that we’ve standardized our deployment process, we’re excited to integrate various existing tools at Airbnb into our continuous delivery pipelines. In 2022 and beyond, we are investing resources into integrating automated load testing, providing a way to safely toggle feature flags, and enabling blue/green deployments to facilitate instant rollbacks. More broadly, we see Spinnaker not only as a tool for code deployments, but also for the automation of various manual processes, allowing engineers to orchestrate any arbitrary workload as a pipeline.

During our migration, we’ve made a ton of modifications, both large and small, to Spinnaker, which is a testament to how flexible the tool is. We will be focused on upgrading to the latest open-source version and are looking forward to contributing some of our changes back to the open-source community.

Conclusion

In our move from a monolithic architecture to SOA, we needed to rethink the way we do deployments at Airbnb.

By creating a Continuous Delivery team focused on delivering great tools to safely and easily deploy code, we were able to migrate from our in-house tool, Deployboard, to Spinnaker. This was a very carefully planned and crafted migration. To adopt the majority of services, we focused on the benefits using a data-driven and automated approach to migration.

As expected, there was a long tail of services that didn’t organically adopt our new tools. We were able to get to the 100% finish line by shifting our strategy towards adding more friction and eventually deprecating our old tool.

This migration now serves as a blueprint for other infrastructure related migrations at Airbnb and we look forward to continuing iterating on our strategies for bringing better tools to our engineers while maintaining existing productivity and reducing toil.

Acknowledgments

All of our achievements wouldn’t have been possible without support of the entire Continuous Delivery team: Jerry Chung, Freddy Chen, Alper Kokmen, Brian Wolfe, Dion Hagan, Ryan Zelen, Greg Foster, Jens Vanderhaeghe, Mohamed Mohamed, Jake Silver, Manish Maheshwari and Shylaja Ramachandra. The entire Developer Platform organization rallied behind this effort. We’re also grateful to the countless engineers at Airbnb that have adopted Spinnaker over the years and have provided us with valuable feedback. We’d also like to thank all of the people at our peer companies and volunteers who have spent countless hours working on the open source Spinnaker project.

Interested in working at Airbnb? Check out these open roles:

Senior/Staff Software Engineer, Developer Infrastructure
Senior Frontend Infrastructure Engineer, Web Platform

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.

--

--

--

Creative engineers and data scientists building a world where you can belong anywhere. http://airbnb.io

Recommended from Medium

Is my engineering team strong? Using the VFQ(H) framework to assess agile team performance

How I’m Improving My Forecasting Skills for Software Projects

Why you should be writing component tests instead of unit tests

Why We Replaced Our Kafka Connector with a Kafka Consumer

‘Why Didn’t This Get Tested?’ End-to-End Testing with Live User Data — ProdPerfect

How to get Scrum Right on First Attempt

Deadlock in Operating system | Explained

Three Things I Learned as a Remote Intern at PayPal

Laptop with coffee, pencils, and crumpled sticky notes around it

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
jens vanderhaeghe

jens vanderhaeghe

More from Medium

The Mystery of MongoDB Indexing

Deep Dive into CQRS — A Great Microservices Pattern

Consider Dropping the Elements of Scrum That Don’t Add Value for You

System Design — UML Activity Diagram