Building Services at Airbnb, Part 3
In the third post of our series on scaling service development, we dive into resilience engineering practices built into the standard service platform that powers the new Services Oriented Architecture at Airbnb.
Airbnb is moving its infrastructure towards a Service Oriented Architecture. A reliable, performant, and developer-friendly polyglot service platform is an underpinning component in Airbnb’s architectural evolution. In Part 1 and Part 2 of our Building Services series, we shared how we used Thrift service IDL-centered service framework to scale the development of services; how a standardized service platform encourages and enforces infrastructure standards; and how to enforce best practices to for all new services without incurring additional development overhead.
Service oriented architecture cultivates ownership and boosts development velocity. However, it imposes a new set of challenges. The system complexity of distributed services is much higher than that of monolithic applications, and many techniques that used to work in monolithic architecture are no longer sufficient. In this post, we share how we built resilience engineering into service platform standards and helped service owners improve the availability of their services.
Resilience is a Requirement, Not a Feature
In distributed services architecture, service resilience is a hard requirement. Each service’s ability to respond and avoid downtime decreases as the inter-services communication complexity increases. As an example, the Airbnb Homes PDP (Product Detail Page) needs to fetch data from 20 downstream services. Assuming we didn’t take measures to improve our resilience, if these 20 dependent services each had 99.9% availability, our PDP page would only have 98.0% uptime, or about 14.5 hours of downtime each month.
Nothing makes the impact of resilience more clear than real-world production incident root cause analysis. In these samples from past production incident postmortems, we can see some common reliability issues.
“We saw multiple waves of increased traffic (external as well as compounded by retries upon request failures). This resulted in a sustained period of reduced availability of our OAuth service from Monorail’s perspective. This affected most of the APIs involved in login flows and resulted in site-wide authentication issues for users thereby impacting core business metrics. Monorail error rates spiked to 25% initially over a period of 5 minutes and then again to 5% over a period of 36 minutes. API errors spiked significantly in the form of increased 500 errors during the outages. This was seen in the form of a dip in impression rates for P2/P3, messages created and other business metrics.”
“As each service box reached its limit, it started to timeout more often and caused the upstream P2 service to retry. Aggressive retries from the P2 service quickly doubled its traffic to the service. As a result, the service attempted to create too many proxy connections, all of which timed out. Latency to all service backends spiked, causing health check failures. Eventually, all P2 service’s boxes are marked down.”
“During the time we saw a 10–20x QPS spike and service request latency increased more than 10-fold at P99. Consequently, upstream clients experienced request failures. The service uses a hashing algorithm that is computationally intensive, and we think the root cause was the spike of requests caused a sharp increase of CPU utilization, which starved resources for other requests and caused health reporting to mark individual service nodes as unhealthy. This localized change can have cascading effects on the rest of the fleet as other node would have to take on more traffic.”
Readers who handled similar systems and incidents must have noticed some common reliability issues: request spikes, system overload, server resource exhaustion, aggressive retry, cascading failures. In the past, common fault tolerant measures such as request timeouts, retries with exponential backoff, and circuit breakers were implemented inconsistently. Our standardized service platform has resilience engineering practices consistently built into all services to reduce the possibility that any single service will be a weak link for the site’s availability.
What We Built
All services on Airbnb’s core booking flow must support high throughput and be fault tolerant to transient failures. Product engineers working on core business services should only need to focus on building new features, not implementing these requirements themselves. The resilience measures that we implemented are well-known patterns and have already prevented downtime in the core booking flow.
Async Server Request Processing
Most of our core booking flow services are Java services. In Part 1, we explained how we extended the Dropwizard web service framework for Java service development. In most Airbnb services, request processing involves multiple RPC calls to downstream services or databases. The version of Dropwizard we were using had synchronous request processing, in which the server threads wait idly while the I/O-intensive work are processed by the worker thread pools. As a result, our services were susceptible to underutilizing resources when waiting on network I/O, unable to absorb bursty traffic, and vulnerable to request overloads or retry storms.
Leveraging the async response feature in the latest Dropwizard framework, we built an end-to-end asynchronous request processing flow that enables writing highly concurrent, non-blocking applications. It connects Dropwizard/Jersey 2 async response, our in-house built async data loader framework, and async service IDL clients for inter-service communication. Asynchronous request processing has a few key benefits:
- Increased throughput: The main I/O threads submit requests to the async executor and returns to accept new inbound requests. Fewer I/O threads can handle more concurrent requests, and better absorb spiky traffic.
- Starvation prevention: A service may have a few endpoints that are I/O or CPU intensive. If a node receives too many requests to these expensive endpoints, it will starve simple and fast requests while trying to handle the expensive ones. With async request process, the main I/O threads are available to accept all requests and dispatch them to request queues of different weights. Expensive and cheap requests can execute concurrently.
Every backend server has a limit to the number of requests that it can handle within its defined service level objectives for latency. This limit is a function of limited resources: CPU, memory, network bandwidth, file descriptors, disk queue, etc. For most services at Airbnb, the typical workload pattern involves data fetching calls to downstream services and databases; applying lightweight logic to compute derived data; and composing a response from the fetched and derived data. In other words, most services are network I/O-bound, not compute-bound, and therefore are more likely to be limited by memory. To understand how resource exhaustion plays a role in causing widespread failure, let’s look at one of the sample production incident postmortems again:
“As each service box reached its limit, it started to timeout more often and caused the upstream P2 service to retry. Aggressive retries from the P2 service quickly doubled its traffic to the service. As a result, the service attempted to create too many proxy connections, all of which timed out. Latency to all service backends spiked, causing health check failures. Eventually, all boxes for the P2 service are marked down.”
Request queuing is one effective technique to allow services absorb bursty request load and prevent services from failing due to resource exhaustion. More specifically, we implemented a controlled delay (CoDel) queue with adaptive LIFO (last-in first-out), as introduced in the article Fail at Scale.
Controlled Delay Queue
In normal operating conditions, the server is able to process requests as they come and empty the queue within N milliseconds, and the CoDel request queue uses N as the normal request timeout value. If the request queue is not being emptied in N milliseconds, it uses a more aggressive request timeout value. The aggressive timeout helps keep a standing queue from building up and therefore protects the server from stumbling under load it can’t keep up with.
In normal operating conditions, the server processes requests in FIFO order. When a queue is building up, however, it switches to LIFO mode. During a queuing period, last-in requests have better chance of meeting their client-side request deadlines than first-in requests, which have been sitting in the queue for longer time.
Adaptive LIFO works nicely with CoDel queue for high throughput, low latency services on Airbnb’s core booking flow because they are effective latency-preventive measures. Moreover, we see that the custom request queue implementation actually improves single service host throughput and error rate.
We benchmarked the controlled delay queue (yellow) against the previous handler, a bounded thread pool (red). In one high throughput Java service, switching to a controlled delay queue improved single-host throughput and reduced error rate.
The request queue does introduce a slight increase in 95th and 99th percentile latencies during overload condition, as shown in the graphs below.
The production incidents postmortems also cite aggressive client retries more than once. In fact, it is the most common cause of cascading failures because the retry storm does not give an overloaded service any room to recover. Our standard inter-service client implements timeout, retry, and circuit breaker patterns. It frees developers from writing their own retry loops, which are usually the source of retry storms. However, a more effective resilience feature is for the client to know when to stop sending requests altogether.
Service Back Pressure
The server-side request queue status is the best signal of whether a service is overloaded. When the request queue in a service host builds up, it switches to LIFO mode and starts fast-failing inbound requests with a back pressure error until the request queue watermark drops to a safe threshold. When propagated to client side, the smart client will refrain from retrying and fast-fail the upstream request. The service back pressure requires server and client coordination, and results in a faster recovery by consciously failing a fraction of requests. More importantly, it prevents cascading failures that we saw in production incidents due to a combination of service overload and retry storm.
In this standard server dashboard graph, we can see that one host experienced a drastic drop in successful QPS, but then recovered within a 1-minute period. The request queue for this instance built up, and the server started sending service back pressure errors. The smart clients honored the service overload signal by not retrying any requests and tripping the client circuit-breakers immediately. This coordination allowed the server to recover on its own and prevent a potential production incident.
API Request Deadline
Another load shedding technique we use is API request deadline, an implementation of RPC deadline propagation, which allows services to gracefully handle request overloads. Airbnb’s public API endpoints all have carefully set request timeouts tied to their latency SLOs. Each inbound API request has a deadline set by the API gateway using the endpoint level timeout. The deadline is then propagated as the request fans out to backend services. The server framework checks the deadline and fast-fails requests that have reached their deadlines. As we continue to build more resilience engineering practices across the inter-service communication stack, we will be more aggressive with setting API request timeout and enforcing deadlines.
Client Quota-based Rate Limit
Service back pressure protects a service from faltering under excessive load, however, it doesn’t distinguish traffic among different upstream callers. For instance, our listings service has over 30 upstream callers. Any misbehaving caller might send an excessive amount of requests and affect other clients.
Airbnb has experienced several core services incidents because of excessive load from rogue clients. We implemented client quota-based rate limiting in the standard service framework, comprised of a counter service and a dynamic configuration system. A service owner can define client quotas in the dynamic configuration system, and the service will use the distributed counter service to track quotas and rate-limit any abusive clients.
The quota-based rate limiting works well with service back pressure in production. During normal conditions, the service’s capacity is distributed fairly among its upstream callers. When a caller misbehaves, rate limiting protects the service from unexpected load, preventing service back pressure errors for other clients.
Dependency Isolation and Graceful Degradation
In a distributed services architecture, a service will often have multiple downstream dependencies. When the number of dependencies increase, the probability of a single problematic dependency bringing down a service becomes higher. Let’s look at one example of this happening at Airbnb:
“The P3 service has a soft dependency on the content moderation service, because only a tiny fraction of requests would call to it. During the incident, the downstream content moderation service became unresponsive and the P3 service async worker threads weren’t freed in a timely manner. Eventually the P3 service exhausted its thread pools, causing elevated errors for all requests.”
Adopting the Bulkhead pattern for our requests could have mitigated this issue. We implemented dependency isolation in the async request executor framework and the inter-service IDL clients on the service platform, similar to the approach of Netflix’s Hystrix. Separate dependencies have separate async worker thread-pools, so that one problematic downstream dependency will saturate only the corresponding worker thread pool. If a problematic dependency happens to be a soft dependency, such as trending information on the product detail page, the service’s availability will not be affected except serving partial responses.
Outlier Server Host Detection
Server host failures are a frequent threat to uptime, as we saw in our postmortems. When you have a large fleet of servers in production, it is only a matter of time before some server hosts start acting erroneously.
“One service instance experienced abnormal status starting at 7:45 pm PDT. QPS for the instance dropped, and the instance’s error rate spiked. The service team’s alert was not triggered until 11:49 pm. At 12:40 am, the on-call engineer discovered the problematic instance and rebooted it to resolve the issue.”
This incident highlighted three tools we needed to improve: monitoring, detecting, and alerting of outlier servers. The combination of dropped QPS and elevated error rate of the faulty server was a clear signal. However, it was not captured and acted upon by our service discovery stack, instead requiring human intervention. On-call engineers had to search across fragmented dashboards to correlate symptoms to the root cause before rotating the faulty server host. As a result, the MTTR (mean time to repair) time for this type of scenario was much worse than it could have been.
Client-side smart load balancing is an automated resilience measure that can prevent these scenarios from happening at all. In client-side smart load balancing, the client maintains a list of server hosts. It tracks success rate and response latency of each host, and avoids routing requests to outlier hosts with low success rate and elevated latency. We added Envoy to our service discovery stack, which supports outlier detection.
In production, when a server host has higher error responses than normal, the outlier detection quickly detects and ejects it, and the host immediately gets less traffic. The automated resilience measure kicks in less than one minute after a server becomes faulty, and service on-call engineers don’t need to be paged at all.
In a service oriented architecture, the inter-service communication complexity increases exponentially as the number of services and the depth of the stack grows. The techniques for maintaining site reliability and availability are different from those of a monolithic architecture. We shared some of the resilience engineering best practices that we built into our standard service platform. They are implemented in the server framework and client libraries cohesively. They are easy to configure and simple to use. We believe that resilience is a requirement instead of an optional feature in a distributed services architecture. This work has effectively prevented a growing number of potential incidents.
Individual services should maintain their SLOs (service level objectives) during all circumstances, whether it’s deploys, traffic surges, transient network failures, or persistent host failures. Standard service resilience engineering can help service owners with that goal. Resilience engineering is instrumental to the entire body of works to help service owners configure their SLO, service availability verification testing, capacity planning, etc. This post covered the low hanging fruit. In our next post, we will share what we built and plan to build for enforcing SLO; failure injection testing; reliability and availability verification testing; and chaos engineering. Stay tuned!
Thanks to Frank Lin, Xing An, Junjie Guan, Mingfei Cai, Mousom Dhar Gupta, Sonia Stan, Jason Jian, Austin Zhu for their contributions to the libraries that are used in our resilience engineering projects. Thank you to Dylan Hurd for editorial review.
If you enjoyed reading the post and feel you will enjoy working on service infrastructure, the production platform team is always looking for talented engineers to join the team.