How Agoda Handles Load Shedding in Private Cloud

Agoda Engineering

Published in

Agoda Engineering & Design

13 min readJun 5, 2024

by Johan Tiesinga

Any application that serves customers must consider capacity planning. This involves determining how many instances of a service are required to maintain a satisfactory experience. But what happens when we don’t have enough capacity? This leads us to graceful degradation: How our systems behave when things are unplanned.

In this article, we’ll explore load shedding, which involves deciding which traffic to serve when you can’t handle all of it. The reason for having insufficient capacity can vary. We might face unexpected high traffic from a promotion, a malicious attempt to take our service offline, or maybe we’ve rolled out a change that doesn’t scale properly despite our best efforts to catch it in testing.

Understanding Load Shedding Principles

The primary function of our systems is to serve customer traffic. If our systems go down, we can’t serve customers, which in turn prevents them from traveling and exploring the world. Additionally, we aim to optimize which traffic we serve during incidents when we can’t handle all traffic. Not all features are equally important. For example, some promotional banners may not be as critical as the actual search the user is performing. Therefore, we should prioritize customer search and drop less critical functions like banner searches.

Let’s consider the reasons behind these necessary choices. When systems handle more traffic than they can sustain, it often leads to unbounded queue growth on worker threads. This usually results in requests timing out while the server is still processing them, even though the client is no longer waiting. Additionally, client applications often initiate retries, increasing the number of requests we need to handle.

Three Grafana panels showing increasing traffic that reaches the breaking point at 100 request per seconds.

For example, in the picture above, the success rate remains stable until it hits approximately 100 requests per second. Beyond this point, as the load increases, the success rate drastically falls to an effective 22 successful requests per second, far below our stable point. This indicates that our system is overloaded and requires protection to limit incoming requests. After enabling our load-shedding tool (indicated by the line change on the mode panel), you can see a sharp recovery, bringing us back to around 100 requests per second and maintaining near-maximum throughput.

Evaluating Capacity

Load shedding is not a new concept at Agoda, but the current implementation relies heavily on configuration to set bounds on what each pod can handle. This places a significant burden on service owners, who must know the capacity of a single pod. Services using multiple cores have invested considerable effort in measuring their throughput and detecting degradations. However, this data does not always extrapolate accurately to traffic in production.

This is partly because any load testing setup involves many implicit assumptions in the test. The major ones are the simulated traffic and the data used. Fortunately, we can test with real production traffic and data. However, there’s still a significant assumption: the traffic pattern itself. This is usually fine for capacity planning, as user traffic patterns are relatively stable. However, this does not always hold when we run promotions for specific regions that may be more expensive to calculate (for example, regions with more hotel offers than average).

There’s another catch with our Kubernetes clusters: They have varying hardware ages. Different generations of hardware have significantly different throughput characteristics, up to a 30% difference per core. This variability makes the concept of “x replicas in my Kubernetes deployment” an unreliable estimate for total deployment capacity. Smaller replica count services, in particular, may experience noticeable differences in an “unlucky” deployment.

Once system owners have collected enough performance data, we established guidelines to include large buffers around the actual request/s results from load tests to account for the many assumptions in the data. However, this approach introduced problems. The bounds we set on our load-shedding thresholds were often too high. Consequently, load shedding worked well when configured correctly but was ineffective when the configuration was wrong. Unfortunately, this led to several incidents where the load shedding wasn’t aggressive enough to keep our services within their capacity.

Learning From our Mistakes

This is a good time to revisit our past blog post, “Adaptive Load Shedding,” to understand our journey. With each incident, we gathered valuable data on how our systems behave under peak load and how our “Gatekeeper” (our Load Shedding implementation) reacted and performed. Analyzing this data poses challenges. If the Gatekeeper doesn’t shed enough load, the service will fail to stabilize. But what about when it’s almost right? The service may still appear fine, but could it have handled more traffic? With various configuration options for system owners, the answer often was that the system wasn’t at maximum load yet. We’ve also seen the opposite scenario: Gatekeeper wasn’t allowed to shed enough load. This highlighted the pitfalls of relying on ad hoc solutions and static numbers.

Another flaw observed several times was Garbage Collection (GC) pauses. Since we primarily run JVM-based applications, our load-shedding implementation was just middleware for our favorite HTTP server. When our applications paused for GC, our load shedding halted as well. This had a two-fold effect:

We no longer measured request execution time correctly, meaning we couldn’t determine how long requests were queued before processing. This made it unclear if we failed our latency SLA.
It triggered a burst of failures in the currently processing requests, polluting the statistics that Gatekeeper uses to predict the current peak capacity.

Let’s pause here for a minute. We mentioned, “predict the current peak capacity.” Our middleware integrates tightly with the application’s environment. It monitors CPU usage, GC times, and in-flight requests alongside the success rate. Using this information, it extrapolates the breaking point at 100% utilization. This enables us to pre-emptively shed low-priority traffic (like scrapers) before we hit peak capacity. This is especially powerful when bot traffic spikes, allowing us to start shedding early with nearly zero impact on users.

All of this is very effective when it works. However, during capacity-related incidents, the key questions were always: Do you know your capacity? Can you configure Gatekeeper to load-shed correctly? The usual answer was No. This is true for smaller services, where investing in thorough load testing might not offer the best return on investment.

Auto-scaling these services would be easier. However, this approach doesn’t hold for larger services. Since we are on-prem, some services face a different problem: We might not have enough standby hardware to “double the pods and absorb the blow.” Additionally, depending on the incident, we may not want to pay for those extra cores. You can read more here if you’re curious about why we operate on-prem.

Let’s Redesign!

From various incidents, we’ve learned that Little’s Law is highly applicable to service overloads. Your service has a finite throughput; if more traffic arrives than it can handle, it will build up queues. In other words, the number of requests “in flight” (currently being handled) increases drastically.

Every time our implicit traffic pattern in load testing didn’t hold in production, the in-flight component was able to detect it. While the requests/s prediction conflicted with the configured capacity (based on load testing with a different traffic pattern), this discrepancy prompted us to revisit all our configuration options. We decided to rely solely on the in-flight metric as a decider, along with tracking the success rate of the service at the inflight traffic levels. This approach significantly reduces the complexity of onboarding, as it “simply works.”

This poses a major difference from the status quo: it would be a reactive model instead of a predictive one. Previously, the old Gatekeeper used its tight integration with the application to extrapolate the current traffic and resource utilization to a probable breaking point. It then preemptively rejected traffic when getting close to that prediction. This approach looks very clean, as the internal success rate (the success rate of the traffic accepted by Gatekeeper) never needs to drop. However, this relies on a few assumptions: it must be able to reach 100% utilization (or you need to know what 100% utilization of the application is), and in Kubernetes, what if you can burst beyond the cores you requested?

In a reactive model, the breaking point is determined by observing the results of API calls. Once we start seeing symptoms like timeouts, we identify the breaking point. This means the application must fail some requests and sustain momentary overload until the Gatekeeper deems it significant and starts rejecting traffic.

Another onboarding challenge is that the current implementation is specific to HTTP server middleware. We use Scala, Kotlin, and C# on our backends, making middleware reuse complex. However, we gained new integration options as we transitioned to Kubernetes. Instead of tightly coupling with the application, it could work through the service mesh. This approach only works with HTTP-based services, but it is an acceptable trade-off for us.

Reactive Capacity Determination

Our old Gatekeeper used three internal calculations to estimate peak capacity. Two of them were relatively straightforward:

“Requests per second” extrapolated the handled requests per second to 100% CPU utilization.
“GC Time per request” used the Garbage Collection time and handled requests per second to extrapolate to the configured GC-time percentage.

These two calculations were excellent for generating a predictive signal, allowing us to preemptively reject low-priority requests to minimize the impact on high-priority requests. However, they relied on coupling with the server runtime to understand the health of the application. In an agnostic solution, this information may not be available.

The last one is Inflight. This rarely “predicted” anything because extrapolating inflight requests to a breaking point is challenging. It’s closely tied to the average latency of a service, which often isn’t linear with increasing loads. This method is reactive: it needs to observe higher loads to determine how the service behaves. The downside is that it will be slower to react, as it can’t predict when the service will likely fail.

The inflight capacity calculation has two main aspects: collecting success/failure data at a given inflight load (e.g., 5000 successes, one failure at five concurrent requests) and calculating when the failures become too frequent, indicating the need to start rejecting traffic. Since inflight will be the only method of capacity calculation, it must react quickly to prevent the service from being overloaded for too long and potentially failing.

We decided to bin a small number of concurrency levels for this calculation. This gives us more data for a single bin without significantly missing the exact capacity point. Unless you have a resource constraint like database connections, capacity tends to fluctuate based on traffic. This trade-off has little downsides for small bins. We then scan each bin from the lowest to the highest concurrency to find the bin’s success rate. After accounting for standard error to increase confidence, we select the first one that fails to reach a 99% success rate.

A table showing inflight and success rate as columns. The inflight shows the bins starting with less than 8, and then incrementing by 2 per row. The success rate column shows over 99% success rate slowly decreasing with higher inflight. At the 5th row with 40–41 an arrow is pointing at the 98.90% success rate indicating this bin is selected by the algorithm.

Now you might wonder: Why choose a failing bin? For compute-bound requests, the cost of a request is constantly changing. By selecting a bin that fails, we allow new “peak” data to enter the calculation, potentially learning that the capacity has improved and enabling more traffic to pass quickly. This approach balances success rate and peak performance. If we only chose “passing” bins, we might never discover that the service can do more (maybe caches got warm).

As previously mentioned, capacity changes constantly with traffic patterns. This means that the data we collect is only valid for a short period. To address this, we’ve implemented a time window for data collection. We only retain data from the last n minutes and use only the first y seconds to determine whether a bin has enough data to decide.

A table showing the internal data of the 8–9 inflight bin which has an overall 98% success rate.

The image above illustrates a smaller-scale representation of one inflight bucket. Each bucket is divided into bins that count the number of requests in the last 10 seconds. These bins also help determine whether an inflight bucket has “significant” enough data. For a “healthy” bucket, we require at least a minute (40 seconds in the sample) of data to exceed a fixed number of requests to confirm its stability.

We only examine failures in the last 20 seconds of data to react quickly to overloads. Focusing on recent data prevents the issue where a bucket with many requests requires an extended period of failures before the overall success rate drops below 99%.

This approach also helps us detect service stalls (like GC pauses) that typically pollute the data. When looking at recent data for significance, we only reject it for a few seconds, recognizing that it was a momentary fault rather than a breaking point. We check bins against the overall success rate to eliminate stalls from the data further. If a specific bin deviates too much, it is considered bad and reset to zero.

These behaviors combined allow us to react within 10 seconds of failures, starting by examining recent data, backing off within 20 seconds if it’s “stalled” data, and still using a more extensive dataset to decide when the bucket has recovered fully on sustained failures.

Integration With the Mesh

Initially, we explored an off-the-shelf centralized load-shedding mechanism. This system used a global view, relying on a moving average of our latency to determine whether the service was overloaded. However, this posed several challenges for our capacity planning. The way it detected overload was fairly simplistic, allowing for up to x% deviation from the moving average. But when we do a data center failover (to avoid potential issues from “dangerous” operational work), we expect our latency to go up (as we don’t scale out, utilization will be higher, and thus, we can’t parallelize the requests as much). This adds a balance: How aggressive should it be to detect capacity issues and not impact our “business as usual”?

Additionally, there were other concerns. Different hardware generations have different breaking points, making the “holistic service view” a downside as some pods will fail earlier than others. While this product could be an improvement when you have nothing, it was a step down from our existing solution. Ultimately, we decided to adapt our existing per-server middleware implementation into a server technology-agnostic solution.

We use Istio as our service mesh implementation, which allows us to extend it with Envoy Filters. To enable the integration, we used the “External Processor” filter to call our new Gatekeeper sidecar, which is deployed alongside the service container. This filter provides a pre-request call to reject requests if needed and a post-call to collect statistics about the service’s performance. This gives us the following new setup:

This presents a challenge: our usual JVM languages aren’t ideal for reliably limiting overhead to 1–2ms latency while maintaining a small footprint, necessary for scaling cost-effectively to thousands of pods. For this reason, we chose Rust, a compiled language without GC pauses. This was a very good choice, providing low latency with minimal extra cost to service owners for this additional safeguard on their services.

Along with our changes to our deployment tooling, getting the “shadow mode” of Gatekeeper on your service became a single checkbox. This is a drastic improvement over the old process, which required weeks of load testing and parameter tuning to get the middleware to behave as expected.

Service Agnostic Load Shedding

This new design brings new challenges. We aim to minimize potential pitfalls for service owners, making sure that everything we implement works universally or involves simple configuration changes, such as switching between shadow and active modes.

To enable this, we rely on the systems’ SLAs: the timeouts we’ve configured in the service mesh. We use these timeouts to determine whether a request is successful or fails due to capacity constraints. If a request can’t meet its latency requirements, it’s likely related to capacity issues.

The next challenge is prioritizing requests. The old Gatekeeper was tightly coupled with the service, allowing the service to determine priority based on its understanding of client behavior. This approach essentially involved guessing. In the new model, the caller must now set the priority since it is disconnected from the service. This solution was not always as straightforward as hoped.

The priority for a search can vary based on different perspectives and requires agreement on what we are optimizing for during overcapacity situations (success rate, bookings, user journey). While the code changes are simple, they need a high degree of alignment between business and tech to prioritize traffic correctly. This sometimes necessitates violating domain boundaries to be effective (e.g., should an API gateway mark critical priority for booking flows? Should it even be aware of that?). Above all, this is an unusual scenario that is not top of mind until it happens.

These discussions led to an interesting capacity management solution in Agoda called “Vulcan”. This system is being integrated with Gatekeeper and historical data to make optimized decisions for us. However, this is a story for another article.

We still have another challenge: Our applications experience “stop the world” garbage collection. While GC does not impact the new Gatekeeper’s request timing, it will still observe a burst of failed requests from the GC pause. This impact is worsened since the new model cannot be fine-tuned per application. With the old model, the service owner could configure the minimum so that the build-up during the pause would not reach the threshold for Gatekeeper to start rejecting requests. This is one of the pitfalls we aim to avoid in the new model. This is where the bucket-level time bins come into play, allowing us to detect and prevent aggressive reactions from Gatekeeper quickly.

Adoption and the Way Forward

With these improvements over the old Gatekeeper, we’ve successfully seen incidents where Gatekeeper handled overload effectively, helping to maximize our uptime with just one click to enable it.

Now, we are pushing for broader adoption throughout our services to learn more about where the new model excels and where it needs refinement. We expect the shedding algorithm will be the most likely area for improvement, but we need to collect data from incidents to determine whether it’s an issue or already sufficient.