Building Services at Airbnb, Part 1:
A Structure for Scaling Service Development
Airbnb is moving its infrastructure at an accelerated pace towards a SOA (Service-Oriented Architecture), but moving from a monolithic Rails service towards a SOA while building out new products and features is not without its challenges.
In this post, we share what we have designed and built to scale the development of services — using the same number engineer-hours to build more backend services that are more robust, performant, and easier to maintain. This is the first in a series of posts on this topic; in this post we present a bird’s-eye view of the approach and overall architecture, and subsequent posts will drill down into specific components.
What We Have Previously
At Airbnb, backend services are mostly written in Java using the Dropwizard web service framework, and a number of Airbnb-specific custom Dropwizard filters and modules standardize server-side best practices. In addition, a make-me-a-service tool helps engineers to bootstrap a ready-to-deploy Dropwizard application skeleton, on top of which engineers add RESTful and JSON-over-HTTP service resource endpoints.
However, the above combination leaves much to be desired; major shortcomings service developers bring up include:
- It lacks clearly defined and strongly typed service interface and data schema. Often it is hard to tell what a service does and how to send requests and receive responses without examining the source code. Moreover, the lack of service/client contract validation had caused production incidents in the past when contract-breaking changes were made.
- A RESTful service framework does not provide service RPCs; developers have to spend a non-trivial amount of time writing both Ruby and Java clients for their services. Good RPC code is much more than an HTTP client wrapper; it should include robust implementations of standard infrastructure requirements and platform best practices — e.g., passing request contexts, measuring requests and performance metrics, propagating service exceptions and causes, enabling mutual TLS, and having monitoring and alerting for service endpoints. All these require hours, if not days, of additional engineering work.
- JSON request/response data payloads are large and inefficient (e.g. string field names) compared to compact, schematized binary formats for performant inter-service communication, especially for payload sizes at the 95th and 99th percentile.
A Structure for Scaling Service Development
A service’s API includes both the service interface and request/response data schemas. In a REST service world, tools such as Swagger can be used for automating some boilerplate tasks like documenting the REST API and providing basic client code generation. However, compared to a full-fledged service RPC framework such as Apache Thrift and Google gRPC, most of these REST tools lack rich features (e.g. strong type support, request/response validation, efficient payload encoding and transport protocols) that would accelerate development of robust and performant services.
Thrift and gRPC are two popular RPC systems that have wide industry adoptions. However, replacing the HTTP service stack at Airbnb with either one means an big infrastructure overhaul that would require a large amount of engineering resources and cause major disruptions to productivity.
After weighing pros and cons of different technical approaches, we decided to take one that does not disrupt the current Java service development process while opening up the possibility of incrementally improving and evolving the services architecture in the future. Our approach is to keep the Dropwizard service framework, add customized Thrift service IDL (interface definition language) to the framework, switch the transport protocol from JSON-over-Http to Thrift-over-Http, and build tools to generate RPC clients in different languages.
We explain in more detail below how this approach works to scale development while minimizing disruption.
Using Thrift as Service IDL
We use Thrift IDL to define a service’s API — its interface and request/response data schema. In the service IDL-based Java service development flow, the developer defines the service API in .thrift files, from which service-side code and RPC clients are generated with Airbnb service platform-standard instrumentations that help enforce infrastructure best practices. This simplifies service development and allows engineers to focus on writing service business logic rather than on plumbing and monitoring work (e.g., inter-service transports, metrics, alerts, and backward compatible API management).
Service developers are familiar with the Dropwizard service framework and have built many common libraries for it. Dropwizard is JSON-over-Http (using Jersey and Jackson libraries), but we want to bring Thrift service IDL to the service development. Since it is important that we do not disrupt the current service framework and inter-service communication protocol, we extended Dropwizard with Thrift payload type support to make the inter-service communication protocol transition seamlessly from JSON-over-Http to Thrift-over-Http.
The use of Thrift as service IDL inherently provides clear definition and documentation for the service API; but moreover, every field in a data schema have clearly defined types that are strictly enforced in the generated Java and Ruby data classes. Managing API backward-compatibility is made simpler and easier with built-in schema enforcement and carefully thought-out server-side and client-side generated code with API changes and backward-compatibility in mind.
We also extended types beyond vanilla Apache Thrift IDL with additions such as Date and DateTime. The rationale behind the type extension is based on past experience — if service developers were restricted to too small a set of types they would end up reshaping the types they want through existing types (e.g, date as a string), bypassing the automated type-checking and validation, and defeat the purpose of having a strong-typed schema.
The Service IDL and Extensions Beyond Apache Thrift
Writing a service entails more work than simply implementing server-side business logic. Implementing service RPC clients in different languages, setting up and executing inter-service communication, adding service-side and client-side metrics for monitoring, and adopting standard service platform practices are tedious to do, easy-to-get-wrong, and incur significant engineering costs. Before service IDL, the cost of creating and managing services resulted in engineers and management shying away from SOA under time pressure.
We want service development with service IDL to be simple, cost-efficient, and provide a step function in terms of added benefits. For that, vanilla Apache Thrift is not enough. We made extensive customization to existing Thrift IDL compilers to generate service boilerplate code and RPC clients with many additional features. We believe it is key in accelerating Airbnb technical stack’s move toward SOA.
The diagram below shows an example Java service, called Banana, using the service IDL, and Ruby service and another Java service using the generated service client to make RPC calls. The components in red are automatically generated/instrumented based on Banana’s service IDL.
Service IDL Dependency and Compiler Integration
A service developer, besides writing their service IDL, can also declare service dependency by pointing to other service IDL files. Both a service’s IDL files and ones from services it depends on are seamlessly integrated into the existing build process and build tools. Service developers do what they have been doing previously and all the IDL-related code they need is automatically generated and shows up in their IDE without additional steps.
Service Framework Boilerplate Code
We extended Apache Thrift code generator with Dropwizard resource generation. The generated service resource class and methods come with necessary Jersey and Jackson annotations to support both JSON and Thrift media types in the HTTP payload — the service can receive and return both JSON and Thrift request/response data. Thrift binary transport is for performance, and JSON is for compatibility with older services and for easy testing and debugging.
Since the glue-code between business logic and Dropwizard is generated, service developers can write code that is agnostic of the underlying web server framework; so if we upgrade Dropwizard or move away from Dropwizard it will be an easier migration.
Before service IDL, engineers had to implement clients for other applications that talk to their service. Not only it is time-consuming to write clients in more than one language, they can become quickly outdated and lose parity (e.g., between Ruby and Java clients). With service IDL, ready-to-use Java and Ruby API clients are completely generated. These generated clients are not the Thrift client for Ruby or Java that come with vanilla Apache Thrift. They are RPC clients that provide many features (e.g., mTLS, retry, validation, request context) not found in vanilla Apache Thrift.
Service Platform Standard Practices
As engineers build more services, it is important that all services adopt consistent best practices. For example, a service request has contextual information associated with it, and the request context should be propagated on the service RPC call chain. This would be difficult to manage if different services had different ways of propagating this context (or not at all), or if this context was defined and obtained in different ways. It would also be a headache for SRE to monitor and debug SOA issues if each service emitted different metrics and had different types of alerts. Emitting standard request and response metrics on both server-side and client-side not only allows for better monitoring in and outside the service-owning team, it can be leveraged for creating automated service dashboards and alerts. Service clients should also adopt standard RPC timeout, retry, and circuit breaker logic; our generated RPC clients makes it so service developers don’t have to implement these RPC resilience features themselves.
Adding Thrift service IDL to our Java service framework is an important first step that towards a reliable, performant, and developer-friendly service platform. It is a centerpiece structure on which many more development-scaling features can be built as Airbnb moves its infrastructure towards SOA.
The new service IDL-driven Java service development process has been widely embraced by product teams since its release. It has dramatically increased development velocity; several new Java services built with service IDL, from the service inception to taking production traffic, took only three weeks — we estimate it to have saved 2–3 weeks of engineering time per Java service over the previous development process. We plan to extend service IDL support to Ruby services as well.
In this post we presented our approach for scaling service development while minimizing disruption, and introduced the service IDL as a centerpiece for this approach. We left out much of the design and implementation details of various components of the service IDL; they will be covered in more depth in subsequent posts in this series. Stay tuned!
If you enjoyed reading this post and thought this was an interesting challenge, the Production Platform team is always looking for talented engineers to join the team.
Many thanks to Victor Peng, Xing An, Mike Parker, Fenglin Liao, Mousom Dhar Gupta, Weibo He, Rahul Agrawal, Charlie Zhou, Qianqian Zhong, Luca Luo, Fengming Wang, Jessica Tai for contribution, support and feedback.