How We Solved the Thundering Herd Problem

At Braintree, a PayPal company, it is no secret that we are big users of Ruby on Rails (Rails for short). We are also big users of a component of Rails called ActiveJob. ActiveJob is an API abstraction on top of existing background job frameworks such as Sidekiq, Shoryuken and many more. In this blog post, I’m going to share how introducing jitter into Rails’ background job API, ActiveJob, stopped a persistent but difficult to diagnose background-process issue. Buckle up!

Context

We have many merchants using our Disputes API, some in real-time in response to a webhook, and others on a daily schedule. That means our traffic is highly irregular and difficult to predict, which is why we try and use autoscaling and asynchronous processing where feasible.

The system architecture:

Disputes API Architecture
Diagram by Anthony Ross @ PayPal

The architecture is simplified for illustration purposes. The flow though is straightforward — merchants interact via SDKs with the Disputes API and once finalized, we enqueue a job to SQS for the submission step.

That queue is part of an autoscale policy that scales in and out based on the queue size. The job it performs differs slightly based on the processor where we deliver the dispute, but it follows a common flow:

1. Generate evidence — we grab everything relevant to a dispute that Braintree has generated and bundle it as metadata.

2. Compile evidence — merchants are allowed to submit evidence in many formats. We have to standardize it and prepend metadata from Step 1.

3. Submit to Processor Service over HTTP.

The processor service is abstracted away because as a Gateway-based service, we have many payment processors we interact with. The processor service takes real-time traffic over HTTP and then submits to the processor in batches. Every few hours, a cron task wakes up, searches for recent submission requests, batches them into a big zip file, and submits them via SFTP to one of our processors. It handles errors and successes and has an API for those various states.

The processor service is HTTP based with a simplified autoscale policy.

The Problem

In the job described above, sometimes we would see a spike of failures, some which automatically recovered and others that didn’t which eventually pushed them into our dead letter queues (DLQ). ActiveJob errors whenever an unhandled exception occurs and we have a policy in place that DLQ’s the message in those instances. To do that with SQS, we simply don’t acknowledge it. We then have a monitor in Datadog that tracks the DLQ size and alerts us when that figure is > 0. While these errors are internal to our jobs service and not merchant facing, they were a problem we wanted to solve.

To be clear, these failures do not affect customers and their API experience, and are ultimately recoverable and retry-able internal process failures. That said, we take dropped processes like this very seriously. That means we need systems that are reliable and that also means we need jobs that can handle transient failures with grace. We set up autoscaling because of traffic spikes and robust retry logic based on years of reliability improvements. Still though, we would be alerted that our DLQ’s weren’t empty and we’d see errors in them such as Faraday::TimeoutError, Faraday::ConnectionFailed, and various other Net::HTTP errors.

class EvidenceSubmissionJob < ApplicationJob  retry_on(Faraday::TimeoutError, wait: :exponentially_longer)
retry_on(Faraday::ConnectionFailed, wait: :exponentially_longer)
retry_on(Faraday::ConnectionFailed, wait: :exponentially_longer)
def perform(dispute)
...
end
end

Why can’t our jobs reach the processor service at certain points in the day? In fact, we had retries for these exact errors so why didn’t the retries work? Once we double-checked that jobs were in fact retrying, we realized something else was going on.

When we looked at our traffic patterns, we could see a big spike in traffic right before the first errors rolled in. We figured it’s a scaling issue. Perhaps our scale out policy was too slow. But that didn’t explain why the retries didn’t work. If we scaled out and retried, this should eventually succeed:

Diagram by Anthony Ross @ PayPal

It wasn’t until we pieced together the timeline that we realized the culprit.

Enter the thundering herd problem. In the thundering herd problem, a great many processes (jobs in our case) get queued in parallel, they hit a common service and trample it down. Then, our same jobs would retry on a static interval and trample it again. The cycle kept repeating until we exhausted our retries and eventually DLQ’d.

thundering herd animation that degrades a service
Animation by Anthony Ross @ PayPal

While we had autoscale policies in place for this, our timing could improve. We would hammer the processor service which would eventually crash it. Then our jobs would go back into the queue to retry N times. The processor service would scale out, but some of our retry intervals were so long, the processor service would inevitably scale back in before the jobs retried. 🤦

Scale in and out policies are a tradeoff of time and money. The faster it can scale in and out, the more cost effective it is. But the tradeoff is that we can be under-provisioned for a period. This was unfortunate and we could feel the architectural coupling of this entire flow.

We put the following plans in place:

1. Stop the bleeding

2. Break the coupling

Stop the bleeding

While we were using exponential backoff, it doesn’t exactly stop the thundering herd problem. What we need is to introduce randomness into the retry interval so the future jobs are staggered. ActiveJob did not have randomness or a jitter argument at the time and so I suggested it via a small PR to Rails. We implemented the change locally and immediately saw our DLQ monitors stopped turning red.

thundering herd solved with jitter introduced
Animation by Anthony Ross @ PayPal

Jitter is explained well by Marc Brooker from AWS. The gist is that if you have 100 people run to a doorway, the doorway might come crashing down. If instead everyone ran at different speeds and arrived at random intervals, the doorway is still usable and the queue pressure is significantly lessened. At least, that’s how I explained it to my kids (except I told them they’re still not allowed to run in the house).

Break the coupling

These two systems were too closely coupled. The processor service did not need to be an independent vertical service as we initially thought. Hindsight is always 20/20, but the simpler we can architect for, the better our team can handle the operational burden of our services. We moved the bulk of the business logic into the Dispute API background jobs and now we can scale those out efficiently.

final architecture showing no processor service
Diagram by Anthony Ross @ PayPal

The code for the animations can be found here.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store