Building an Effective Test Pipeline in a Service Oriented World

Joey Ye
The Airbnb Tech Blog
12 min readFeb 4, 2020


Learn about how we built an integration test pipeline for the testing of critical business flows spanning across multiple services in Airbnb.

This staircase connects several floors of our beautiful new San Francisco office, 650 Townsend.


Over the past 2 years, Airbnb engineering has been working on a major initiative to migrate from a gigantic Rails Application to a decoupled service-oriented architecture, or SOA, as we call it internally. As a result of the migration, some critical business flows that used to live in a monorepo are now converted to separate SOA services. Testing these services became a challenge:

  • Engineers need to thoroughly test these services as a whole piece before any of their change to it deploys to production, this thorough test takes time.
  • Airbnb cares a lot about quality and functional correctness. We require the critical business flows being thoroughly tested before Production by Airbnb Engineers to avoid any potential impact to Airbnb hosts or guests.
  • However, engineers want high developer productivity. It is one of the important reasons why Airbnb migrates to SOA. Different teams can own different services so that they can iterate quickly on their services separately.

In this blog post, we’ll talk about how we solve the testing challenge we face in an SOA world by building an effective test pipeline for some of our most complex business flows. We will share the workflow of our prior test pipeline. We will illustrate its challenges and issues and describe the new test pipeline.

The Prior Test Pipeline: A Pure, Continuous Integration Pipeline Without Continuous Delivery

Our prior test pipeline was built to support the test requirements of the SOA migration.

The SOA migration extracted critical business flows, which lived in a monorepo, into separate SOA services. We have different types of services that we needed to migrate:

  • Some are services with database and expose RESTful APIs;
  • Some are also kafka producers and consumers;
  • Some may trigger jobs to be executed by a job scheduling service, immediately or with a delay.

There are complex interactions between these migrated services using synchronous API calls or asynchronous event processing or job execution. The test of the SOA migration needed to ensure no end user impact during and after the migration. We needed to ensure the following invariants were maintained:

  • All business flows work the same as before for Airbnb guests and hosts. The underlying tremendous change shall be unnoticeable to them.
  • There shall be no unexpected bad changes to production that may cause end user impact.

The Deep Integration Test CI

To meet the test requirement of preventing end user impact, we chose to enumerate and write as many heavyweight integration tests as possible. These integration tests ensured the complex service interactions work as expected, and all business flows worked as expected. The process of writing these tests worked like this:

  • Enumerate all possible user scenarios as much as possible from the end user’s perspective.
  • Extract backend flows from the user scenario enumerations.
  • Convert backend flows into heavyweight integration tests that test complex service interactions.

We refer to these types of integration tests as “Deep Integration Tests” that validates the Service or Application together with all of its soft or hard dependencies.

To meet the test requirement of ensuring no bad change deployed to production we chose to run the deep integration tests prior any code change merging to master branch.

  • In Airbnb code changes to any backend service that are merged to master will later get deployed to production.
  • There was no formal deployment pipeline to control the deployment order and verification process. There was a chance that a bad master snapshot may get deployed to production before it got deployed and verified in a pre-production environment. To avoid any unexpected bad deployment means the tests needed to be triggered prior merging to master.

We created a single integration test project and put all the tests there. We built a single integration test CI that would run all the deep tests. The CI was triggered per commit.

The deep integration test CI workflow was as follows:

  • We built the integration test CI platform on top of Buildkite. Buildkite is a platform for running fast, secure, and scalable continuous integration pipelines on your own infrastructure.
  • The integration test CI needed to test all the related services together, which means any change in the related services would trigger the CI.
  • Each run required environment setup and tear down, which deployed all the related services to the private development environment.

Besides integration test CI, we also have Unit Test CI. Unit Test CI was more efficient and reliable than the integration test CI because there was no environment setup time and network cost.

The prior test pipeline played an important role in preventing bad changes from merging to master. Also, because engineers added many deep integration tests and had all of them run in CI at pre-merge time, they were more confident in rolling out traffic in production during the SOA migration.

The Scaling and Maintenance Challenges

The prior test pipeline worked well when the number of supported services and tests were relatively small. However, it did not work well as more services and tests were added.

It negatively impacted developer productivity

  • The CI runtime became long. The more services and tests supported in the CI, the longer the runtime became. The runtime soon became a bottleneck of developer productivity.
  • The CI was not stable. Most of the tests were deep integration tests that had many dependencies. There were not only steps that involve synchronous API calls, but also dependencies on asynchronous event producers and consumers, as well as job scheduling. As a result of the heavy steps, some of the tests became flaky.
  • The time to write an integration test was long. You need to set up a local test environment in order to write a new test and test it locally first. As more services added into the whole piece, time required to setup all of them became long.
  • The time to debug a failed integration test was long. A deep integration test may touch a lot of services. Once it failed it was very hard for the engineers to tell which service logs they should investigate.
  • Engineers were not able to get feedback quickly. For example, they waited for a long time at pre-merge time, and found the CI failed only because a certain service failed to start due to an invalid config change. This could have been caught by a simple service-level CI that only started that service and ran some simple endpoint tests, not necessarily by a complex deep integration test CI.

It’s hard to scale and maintain

The CI was designed to test all related services together. All services were deployed and all the tests were run during the CI process. This made the CI hard to maintain and scale out: CI runtime and stability soon became a problem when it tried to support many services and tests.

It negatively impacted test best practices

Industry test best practices around testability tend to think about testing as a Pyramid. A large number of fast, reliable small tests should be at the base of the pyramid. Moving toward the top of the pyramid, tests begin to increase in complexity and runtime, but the number of them decreases.

However, the prior test pipeline tended to suggest engineers to write more heavyweight, deep integration tests that are hard to maintain and slow to run. These deep integration tests should be at the top, instead of at the middle of the test pyramid.

It lacked CD validations

All the validations were done at pre-merge time. This lack of validations at CD time suggested a bad engineering practice: encourage engineers to test in production directly after they merged their code to master instead of pre-production. Another side effect of not having CD validations using pre-production environment is there is no effective way to monitor and guarantee the health of a pre-prod env.

Our Goals

As Airbnb grows, there are more SOA services joining the critical piece. Always trying to test the whole critical flow together for each commit pre-merge was not a scalable and maintainable approach. Our test pipeline needs to be more adaptable to the SOA world.

  • It should follow industry test best practices: Test Pyramid. Write tests with different granularities. The more high-level you get, the fewer tests you should have.
  • It should run different levels of tests at different stages. Lower level tests should be run at earlier stages.
  • Having different levels of tests ensures the whole critical piece is thoroughly tested in different ways before production
  • Running different levels of tests at different stages ensures fast feedback, which is good for developer productivity.
  • It should be scalable and maintainable. This scalability and maintainability should not be impacted by more services supported or more tests added.

The New Test Pipeline: A Full Test Pipeline With both Continuous Integration and Continuous Delivery

A brief overview

Below is a full diagram of the new test pipeline.

There are two phases of the pipeline: Continuous Integration (CI) and Continuous Delivery (CD).

  • During the CI phase, it runs unit tests and shallow integration tests. Shallow integration tests validate the service in complete isolation from its hard and soft dependencies, which is more lightweight.
  • It still uses BuildKite as the pipeline tool and Private Development environment as the test environment.
  • During the CD phase, it runs deep integration tests. These were the integration tests that used to be run at pre-merge CI time in the prior test pipeline, but the number of this kind of higher-level tests is reduced by following the test pyramid best practice.
  • It uses Spinnaker as the pipeline tool. Spinnaker is an open source continuous delivery platform that can provide real control of the deployment order as well as flexible customized verification steps after each deployment. With Spinnaker Automatic Canary Analysis(ACA) is enabled as well, which is the verification step after deploying to Canary.
  • It uses Staging as the test environment. This is the shared pre-production environment of Airbnb.

Test Pyramid

The new test pipeline runs different levels of tests at different test stages. The different levels of tests form the test pyramid. Below is a diagram of the test pyramid used in Airbnb.

As we go up from the bottom to the top of the pyramid, the scope of the test becomes larger, which means it is more real and end to end. At the same time, the test becomes slower, less reliable, and harder to debug because more complex steps are involved and more components are touched.

In the context of implementing the test pyramid, there are two principles:

  • If a higher-level test spots an error and there’s no lower-level test failing, you need to write a lower-level test.
  • Push the tests as far down the test pyramid as you can.

Here is how these principles translate to our testing practices:

  • Use unit tests to test business logic within service.
  • Use Shallow Integration Tests to test service behavior with complete isolation from other dependencies. This type of test is designed not to be impacted by any of the service’s dependencies because the major purpose of shallow integration test is to verify this single service only, thus all its dependencies should be mocked.
  • The mock framework used for mocking the dependent services allows you to define fixture data in yml format that includes a set of request and response pairs for a specific API. It also defines match rules to match your service’s API request with an expected API response from the fixture data. With the framework enabled, a service’s calls to its dependent services can be mocked without sending real network requests, and instead the matched response is returned directly. Below is the basic format of the fixture data:
  • Use deep integration tests to test service behavior with service interactions. But engineers should follow the two rules of the test pyramid mentioned above.
  • Before adding a deep integration test, check whether the scenario it tests is covered in other deep integration tests, whether it is possible to break it into smaller pieces and put into unit test or shallow integration test.

The Shallow Integration Test CI

The new test pipeline runs shallow integration tests at CI phase. Below is its workflow:

The whole flow works similarly as the deep integration test CI flow in the prior test pipeline. The differences are:

  • It only deploys 1 service, while the prior one deploys the whole piece of services.
  • It is triggered when there is code change of the service tested, while the prior one is triggered when there is code change of any part of the whole piece.
  • It runs shallow integration tests, while the prior one runs deep integration tests.

It’s a more lightweight CI that runs faster, less flaky than the previous heavyweight integration test CI due to the above differences.

The CD Pipeline

The new test pipeline has the full CD phase pipeline run by Spinnaker. Below is a brief workflow of it:

  • The default pipeline goes through all the pre-production deployments and validations including deep integration tests and ACA.
  • The emergency pipeline can go to deploy production directly, but that is used for emergency fixes only. Other than an emergency, any kind of bypass of pre-prod deployment or validation is not allowed.

Key Takeaways

The new test pipeline has proven to be more effective than the prior one. Below are its key takeaways.

It scales without being limited by the number of services supported.

The prior test pipeline was not scalable due to it being a single integration test CI for all the involved services. The runtime of the CI increased as the number of services supported increased. The new test pipeline has each service defines its own test pipeline and has its own tests so testing of each service can be run separately. Adding test support of a new service means setting up a new test pipeline for that service, no impact to existing test pipelines.

It is more maintainable as tests are formed into Test Pyramid

The prior test pipeline became increasingly unstable as more heavyweight deep integration tests were added which were by nature more likely to be flaky and harder to maintain. The new test pipeline forms the tests into a test pyramid. There are more lower-level tests that are easier to maintain and less flaky than higher-level tests.

It provides higher developer productivity

Compared to the prior test pipeline that ran higher-level tests, which is slower, engineers can merge their code faster since the CI runs faster lower level tests. Especially when there is a failure, it can fail fast so engineers can get fast feedback. Engineers can also debug a failed lower level test easier, as they only need to check 1 service.

It provides a more comprehensive testing

By having different levels of tests, the new test pipeline actually has better test coverage than the prior one.

  • It encourages engineers to write more small tests than big ones. Small tests are easier to write and maintain than big ones, so engineers can write and maintain more tests, which provides better coverage.
  • It forces engineers to test the functionality at different layers and granularity: internal business logic, service endpoint functionality, inter-service behaviors are all covered.

It follows and encourages the test best practices

The test pyramid rules encourage engineers to split tests into smaller pieces and pushing the tests as far down the test pyramid as one can. Smaller tests are easier to read, easier to write them clean. They also encourage avoid test duplication by replacing higher-level tests that spot an error while no lower-level tests fail with lower-level tests.

Looking Forward

We gradually rolled out the new test pipeline within our group. It was also used by some other groups with similar business scenarios. However, there is still tremendous work to do. Different teams in Airbnb may have different test requirements for the test process, tools and environments, due to their unique tech stacks and business scenarios. For example, for some teams it may be more suitable for them to use traffic replay for testing.

Our Continuous Integration, Continuous Delivery, and Developer Productivity teams are working on a generic pre-production testing environment that can meet the testing purposes of different teams. They are also working on a generic test running infrastructure that can run all types of tests written in different technologies, whether it is java, ruby, javascript or any language, in the same way. More details of this generic test infrastructure support will be shared in the future.

Many thanks to Junjie Guan, Byron Grogan, Jens Vanderhaeghe, Gilbert Huang on providing the Mock framework, the CI/CD platform support and environment support to the test pipeline. Many thanks to my manager Alice Liang and my colleagues Michel Weksler, Gary Leung, Jason Jin, Jacob Zhang, Dipak Pawar, Eric Yu, Hosanna Fuller, Saleh Rastani on the support of the project.