Building an Effective Test Pipeline in a Service Oriented World

Joey Ye
Joey Ye
Feb 4, 2020 · 12 min read

Learn about how we built an integration test pipeline for the testing of critical business flows spanning across multiple services in Airbnb.

This staircase connects several floors of our beautiful new San Francisco office, 650 Townsend.


Over the past 2 years, Airbnb engineering has been working on a major initiative to migrate from a gigantic Rails Application to a decoupled service-oriented architecture, or SOA, as we call it internally. As a result of the migration, some critical business flows that used to live in a monorepo are now converted to separate SOA services. Testing these services became a challenge:

  • Engineers need to thoroughly test these services as a whole piece before any of their change to it deploys to production, this thorough test takes time.

In this blog post, we’ll talk about how we solve the testing challenge we face in an SOA world by building an effective test pipeline for some of our most complex business flows. We will share the workflow of our prior test pipeline. We will illustrate its challenges and issues and describe the new test pipeline.

The Prior Test Pipeline: A Pure, Continuous Integration Pipeline Without Continuous Delivery

Our prior test pipeline was built to support the test requirements of the SOA migration.

The SOA migration extracted critical business flows, which lived in a monorepo, into separate SOA services. We have different types of services that we needed to migrate:

  • Some are services with database and expose RESTful APIs;

There are complex interactions between these migrated services using synchronous API calls or asynchronous event processing or job execution. The test of the SOA migration needed to ensure no end user impact during and after the migration. We needed to ensure the following invariants were maintained:

  • All business flows work the same as before for Airbnb guests and hosts. The underlying tremendous change shall be unnoticeable to them.

The Deep Integration Test CI

To meet the test requirement of preventing end user impact, we chose to enumerate and write as many heavyweight integration tests as possible. These integration tests ensured the complex service interactions work as expected, and all business flows worked as expected. The process of writing these tests worked like this:

  • Enumerate all possible user scenarios as much as possible from the end user’s perspective.

We refer to these types of integration tests as “Deep Integration Tests” that validates the Service or Application together with all of its soft or hard dependencies.

To meet the test requirement of ensuring no bad change deployed to production we chose to run the deep integration tests prior any code change merging to master branch.

  • In Airbnb code changes to any backend service that are merged to master will later get deployed to production.

We created a single integration test project and put all the tests there. We built a single integration test CI that would run all the deep tests. The CI was triggered per commit.

The deep integration test CI workflow was as follows:

  • We built the integration test CI platform on top of Buildkite. Buildkite is a platform for running fast, secure, and scalable continuous integration pipelines on your own infrastructure.

Besides integration test CI, we also have Unit Test CI. Unit Test CI was more efficient and reliable than the integration test CI because there was no environment setup time and network cost.

The prior test pipeline played an important role in preventing bad changes from merging to master. Also, because engineers added many deep integration tests and had all of them run in CI at pre-merge time, they were more confident in rolling out traffic in production during the SOA migration.

The Scaling and Maintenance Challenges

The prior test pipeline worked well when the number of supported services and tests were relatively small. However, it did not work well as more services and tests were added.

It negatively impacted developer productivity

  • The CI runtime became long. The more services and tests supported in the CI, the longer the runtime became. The runtime soon became a bottleneck of developer productivity.

It’s hard to scale and maintain

The CI was designed to test all related services together. All services were deployed and all the tests were run during the CI process. This made the CI hard to maintain and scale out: CI runtime and stability soon became a problem when it tried to support many services and tests.

It negatively impacted test best practices

Industry test best practices around testability tend to think about testing as a Pyramid. A large number of fast, reliable small tests should be at the base of the pyramid. Moving toward the top of the pyramid, tests begin to increase in complexity and runtime, but the number of them decreases.

However, the prior test pipeline tended to suggest engineers to write more heavyweight, deep integration tests that are hard to maintain and slow to run. These deep integration tests should be at the top, instead of at the middle of the test pyramid.

It lacked CD validations

All the validations were done at pre-merge time. This lack of validations at CD time suggested a bad engineering practice: encourage engineers to test in production directly after they merged their code to master instead of pre-production. Another side effect of not having CD validations using pre-production environment is there is no effective way to monitor and guarantee the health of a pre-prod env.

Our Goals

As Airbnb grows, there are more SOA services joining the critical piece. Always trying to test the whole critical flow together for each commit pre-merge was not a scalable and maintainable approach. Our test pipeline needs to be more adaptable to the SOA world.

  • It should follow industry test best practices: Test Pyramid. Write tests with different granularities. The more high-level you get, the fewer tests you should have.

The New Test Pipeline: A Full Test Pipeline With both Continuous Integration and Continuous Delivery

A brief overview

Below is a full diagram of the new test pipeline.

There are two phases of the pipeline: Continuous Integration (CI) and Continuous Delivery (CD).

  • During the CI phase, it runs unit tests and shallow integration tests. Shallow integration tests validate the service in complete isolation from its hard and soft dependencies, which is more lightweight.

Test Pyramid

The new test pipeline runs different levels of tests at different test stages. The different levels of tests form the test pyramid. Below is a diagram of the test pyramid used in Airbnb.

As we go up from the bottom to the top of the pyramid, the scope of the test becomes larger, which means it is more real and end to end. At the same time, the test becomes slower, less reliable, and harder to debug because more complex steps are involved and more components are touched.

In the context of implementing the test pyramid, there are two principles:

  • If a higher-level test spots an error and there’s no lower-level test failing, you need to write a lower-level test.

Here is how these principles translate to our testing practices:

  • Use unit tests to test business logic within service.
  • Use deep integration tests to test service behavior with service interactions. But engineers should follow the two rules of the test pyramid mentioned above.

The Shallow Integration Test CI

The new test pipeline runs shallow integration tests at CI phase. Below is its workflow:

The whole flow works similarly as the deep integration test CI flow in the prior test pipeline. The differences are:

  • It only deploys 1 service, while the prior one deploys the whole piece of services.

It’s a more lightweight CI that runs faster, less flaky than the previous heavyweight integration test CI due to the above differences.

The CD Pipeline

The new test pipeline has the full CD phase pipeline run by Spinnaker. Below is a brief workflow of it:

  • The default pipeline goes through all the pre-production deployments and validations including deep integration tests and ACA.

Key Takeaways

The new test pipeline has proven to be more effective than the prior one. Below are its key takeaways.

It scales without being limited by the number of services supported.

The prior test pipeline was not scalable due to it being a single integration test CI for all the involved services. The runtime of the CI increased as the number of services supported increased. The new test pipeline has each service defines its own test pipeline and has its own tests so testing of each service can be run separately. Adding test support of a new service means setting up a new test pipeline for that service, no impact to existing test pipelines.

It is more maintainable as tests are formed into Test Pyramid

The prior test pipeline became increasingly unstable as more heavyweight deep integration tests were added which were by nature more likely to be flaky and harder to maintain. The new test pipeline forms the tests into a test pyramid. There are more lower-level tests that are easier to maintain and less flaky than higher-level tests.

It provides higher developer productivity

Compared to the prior test pipeline that ran higher-level tests, which is slower, engineers can merge their code faster since the CI runs faster lower level tests. Especially when there is a failure, it can fail fast so engineers can get fast feedback. Engineers can also debug a failed lower level test easier, as they only need to check 1 service.

It provides a more comprehensive testing

By having different levels of tests, the new test pipeline actually has better test coverage than the prior one.

  • It encourages engineers to write more small tests than big ones. Small tests are easier to write and maintain than big ones, so engineers can write and maintain more tests, which provides better coverage.

It follows and encourages the test best practices

The test pyramid rules encourage engineers to split tests into smaller pieces and pushing the tests as far down the test pyramid as one can. Smaller tests are easier to read, easier to write them clean. They also encourage avoid test duplication by replacing higher-level tests that spot an error while no lower-level tests fail with lower-level tests.

Looking Forward

We gradually rolled out the new test pipeline within our group. It was also used by some other groups with similar business scenarios. However, there is still tremendous work to do. Different teams in Airbnb may have different test requirements for the test process, tools and environments, due to their unique tech stacks and business scenarios. For example, for some teams it may be more suitable for them to use traffic replay for testing.

Our Continuous Integration, Continuous Delivery, and Developer Productivity teams are working on a generic pre-production testing environment that can meet the testing purposes of different teams. They are also working on a generic test running infrastructure that can run all types of tests written in different technologies, whether it is java, ruby, javascript or any language, in the same way. More details of this generic test infrastructure support will be shared in the future.

Many thanks to Junjie Guan, Byron Grogan, Jens Vanderhaeghe, Gilbert Huang on providing the Mock framework, the CI/CD platform support and environment support to the test pipeline. Many thanks to my manager Alice Liang and my colleagues Michel Weksler, Gary Leung, Jason Jin, Jacob Zhang, Dipak Pawar, Eric Yu, Hosanna Fuller, Saleh Rastani on the support of the project.

The Airbnb Tech Blog

Creative engineers and data scientists building a world…