Airbnb is moving its infrastructure towards SOA. A reliable, performant, and developer-friendly service platform is essential in an architectural evolution. In a previous post, we shared a bird’s-eye view of what we have designed and built to scale the development of services and gave a glimpse of how the Thrift service IDL-centered service framework had helped to increase development velocity.
As engineers build more services, it is critical that the services adopt a consistent set of platform-wide standards and practices. In the second of this series of posts, we share how we build a service platform that automates and enforces standards and best practices.
Why A Standard Service Platform
First, let’s see what developers may choose to do in their Java projects when there are no standard practices to follow:
- POJO design in different projects follow different patterns: some are immutable objects; others are not. Some use builder patterns; others have numerous constructors; Some use Lombok; others do getters and setters manually.
- JSON-object mappers are configured differently depending on the service. Object serialization and deserialization may fail due to inconsistencies.
- An end-client request has various contextual information associated with it, but not all services propagate the request context when calling downstream services.
- Services emit non-standard server-side or client-side metrics, and service owners create vastly different metrics dashboards.
- Services have inconsistent alerts coverage for health monitoring and anomaly detection; alerts are sometimes missing for newly created endpoints.
- Different database client and http client libraries are used, and metrics, logging are missing in the wrapper code.
- Service clients use different RPC timeout, retry, and circuit-breaking logic.
All these boil down to one word: inconsistency. A variety of issues surface as consequences. Developers must budget extra time for building new services to do lots of plumbing work instead of focusing on the business logic; mismatched JSON serialization code result in server-side errors that takes much time to track down and fix; SREs find it painful to monitor services and debug issues due to inconsistent or missing metrics and alerts.
As more services are built, it is critical that they all adopt published standards and consistent practices on the service platform. However, merely documenting standards and recommending best practices for engineers to follow do not work. Engineers often forget them or push them off until later (and then forget) when focused on coding business logic to meet product deadlines. The most effective way to enforce adoption of standards and practices is to build them into the service platform on which the engineers develop their services.
What We Built
We use a heavily-customized Thrift IDL to define a service’s API: its interface and request/response data schema. In the service IDL-based Java service development flow, the developer defines the service API in .thrift files, from which service-side code and RPC clients are auto-generated with Airbnb service platform’s standard instrumentations that enforces Airbnb’s infrastructure standards and practices.
Request and Response Context
An user request to the Airbnb web application has a number of contextual information associated with it. At the bare minimum it contains a unique user id (for logged-in user) or a visitor id (for logged-out user). Richer context data include user IP address, user locale, country, currency, browser, and device type. There are many uses for the request context data:
- Feature rollout based on user/visitor id, locale, and country.
- Experiment strategies employing user id, country, locale, or currency.
- Authorization policy check based on user id, IP address, and country.
- Service framework rate-limiting based on user id and API endpoint.
In the past, request context was available in the monolithic Airbnb Rails application but were not propagated to downstream services. That was problematic. For example, not having the user id made it difficult to do guests-side experiments or incremental feature rollouts that required the coordination of several services. In the SOA world, the request context should be propagated along with a request. The request context also enables standard service platform security policy checks and rate-limiting. For service reliability and resilience, the request context is used to propagate a distributed service RPC time budget (for response deadlines and retries).
In the opposite direction, the service platform should also have contextual information propagated in the response flow of the service RPC call graph. For service reliability and resilience, it is important for a downstream service to signal its upstream service on the current state of the RPC so that the best retry and back pressure decisions can be made. At some point in the call graph, a service may also want to assess the request for end-user trust and safety risks. If a risky request is detected, an error can be propagated upstream, and the API layer can hard-block a suspicious user from performing further action. These response contextual data are complementary to the response itself and is required for a great many use cases.
In our service IDL-centered development, request and response context schemas are defined with Thrift structs, from which both Java and Ruby classes are generated. This guarantees consistent request and response context across Java and Ruby services. Context data are passed through http request/response headers so that they are separate from the actual payload. On the server side, a request context middleware extracts headers and creates a request context object that is accessible to the server business logic code; likewise, a client-side middleware creates request context headers and extracts response context headers. A request context is instantiated for each end-user request in the API layer and is passed along the entire request flow on the service platform.
Standard Metrics and Dashboards
At Airbnb, a service owner must complete a production-readiness checklist before his service is launched into production. A critical item in that list is metrics. In the past, service owners emitted different metrics with uneven coverage and created metrics dashboards that varied greatly in completeness, correctness, up-to-date-ness, and interpretation. The old metric naming convention was to prefix all metrics of a service by the service name. For example, the Banana service would have a banana.service.request.count metric. It led to services having different metric names for response latencies and error counts, which further led to each service having its own service dashboard. An additional issue was difficulty in finding the right dashboard and the right graph, especially during an incident when sysops engineers really need it as quickly as possible for fire-fighting.
In the service IDL-driven development, the service resource method boilerplate code and the entire client code are generated from the service’s Thrift IDL. It allows automated emission of standard service and client metrics in the inter-service communication layer, and from which templated standard dashboards come free and ready-to-use for all services.
All standardized metrics will have the same root prefix services_platform, followed by either service or client, and then the metric name. For example, services_platform.service.request.count and services_platform.client.request.count track request counters at the server and client side respectively. It is important to have a uniform metric prefix because it makes possible to create one standard service dashboard for graphing of all standard metrics on Datadog. Standard metrics are decorated with various tags which allow breakdown and drill-down of health and performance metrics of a service’s individual hosts and resource endpoints.
Examples of server-side metrics:
- services_platform.service.request.count, tags: [service, role, host, method, caller]
- services_platform.service.response.count, tags: [service, role, host, method, caller, success]
- services_platform.service.response.[median|p75|p95|p99], tags: [service, role, host, method, caller, success, status_code, status_family]
- services_platform.service.exception.count, tags: [service, role, host, method, exception_class, exception_type]
Standard server-side dashboards are provided for all server-side metrics with the services_platform prefix. They are templated so different services can use the same dashboard, and template variables allow to drilling down into host and method metrics:
Client-side metrics are emitted in IDL-generated Ruby and Java RPC clients, and they mirror their server-side counterparts:
- services_platform.client.request.count, tags: [service, role, host, method, caller]
- services_platform.client.response.count, tags: [service, role, host, method, caller, success]
- services_platform.client.response.[median|p75|p95|p99], tags: [service, role, host, method, caller, success]
- services_platform.client.exception.count, tags: [service, role, host, method, caller, exception_class, exception_type]
Standard client-side dashboards to match:
Service API Alerts
Managing a robust service requires close monitoring of the service’s resource endpoint health metrics, and alerts on anomalies associated with the metrics is an essential part of that. However, in the past it was entirely up to the service developers to manually write alerts corresponding to each endpoint. We often observed that:
- New service resource endpoints were added without corresponding alerts.
- Coverages were inconsistent: for example, some endpoints may alert only on latency while others alert only on error rate.
Consequently, these gaps in alert coverage resulted in service owners not being paged when anomalies occurred; they were notified by sysops or upstream client service owners that their service was failing much later. On the standard service platform, the availability of standard server and client metrics makes it possible to automate service alerts generation.
In the service IDL-centered service development, a .thrift file defines the service’s interface: a set of strongly-typed resource methods. It is the source-of-truth for the API of a service, and therefore the ideal place to define standard API alerts. In our modified version of Apache Thrift compiler, we added service- and method-level resource annotations for the modified code generator to also generate a number of standard alerts for each method.
A service developer can add service API alerts using the following simple procedure:
- Add alert annotations in the service’s IDL. The alert annotations can be at both method level and at the service level.
- Indicate the desired alerting threshold values for the standard set of service alerts via the annotations.
- Run the service IDL alert generator tool to create a full set of standard alert files from the service’s IDL file.
- In the absence of alert annotations, the service IDL alert generator creates alerts using default threshold values.
The standard service alerts include high p95 latency, high p99 latency, high error rate, and low QPS alerts at the method level for each method, and high error rate and low QPS alerts at the service level. This ensures coverage for all endpoints for every service on the service platform.
Because the thresholds are defined on the methods in the IDL where the method themselves are defined, it is easy to find what the alerts are for each resource method; it also packages defining alerts together with defining the methods to clearly indicate that alerts are a first-class item when it comes to service development.
When more services are launched, the operational overhead of monitoring the site and debugging issues increases. If services have inconsistent metrics and varied alerting coverage, the difficulty of maintaining the reliability and availability of Airbnb core booking flow increases even more.
Standard health and performance monitoring is not only key to maintain, debug, and improve our services, they are critical in every service deployment to ensure no regressions happen. Generating alerts on the standard service metrics allows service owners and sysops be paged as early as possible. The standard service platform encourages and enforces infrastructure standards and best practices to all services without incurring additional development overhead. It makes standard specifications like request context, response context, mutual TLS easily available to service developers. The service IDL-centered service platform enables engineers to focus on writing service business logic rather than plumbing and monitoring work. A conservative estimate of time saved over the previous service development process using the service platform with the standardization features described here is 2–3 weeks.
In the next post, we will delve into details on how we help service owners analyze and manage performance on the platform. Stay tuned!
If you enjoyed reading the post and are interested in working on services infrastructure, the production platform team is always looking for talented engineers to join the team.
Many thanks to Junjie Guan, Victor Peng, Xing An, Mike Parker, Weibo He, Mingfei Cai for contributions to works presented in the post. Many thanks to Charlie Zhou, Lynn Lu, Ke Pan, Qianqian Zhong, Tang Zhang, Fengming Wang, Fenglin Liao, Tiffany Low, Jimmy Ngo, Paul Baumstarck, Brian Wolfe, Swapnil Ralhan for valuable feedback.