Expecting the unexpected: Managing 3x traffic surges at Whatnot

Published in

Whatnot Engineering

10 min readJul 17, 2024

Welcome back to our three-part series on how Whatnot supercharged its tech for the 2024 Super Bowl’s 0% commission pregame event! In this second installment, we’re peeling back the layers on how we handled a tidal wave of traffic without breaking a sweat. Our goal? Keep things smooth and seamless for our buyers and sellers, even when the unexpected hits. Let’s dive right in and explore the clever strategies and robust protections we put in place to maintain top-notch user experience.

Missed part 1? Catch up here: Whatnot’s Data-Driven Approach to Scalability & Reliability for Big On-Platform Events.

Karol Gil | Engineering

Handling massive events on Whatnot always brings surprises. We brace for more livestreams, users, bids, and purchases, but reality often throws in extra twists.

During our zero-commission Super Bowl weekend promo, we knew traffic would at least double. To stay ahead of potential hiccups, we introduced load shedding, a form of traffic management, in our Main Backend service. This proactive measure detects and mitigates unforeseen issues, ensuring users have a stellar experience.

In this post we’ll discuss

architectural choices we made,
implementation and stress testing, as well as
deployment and outcomes during large events.

Let’s dive in!

Architectural Foundations

Our system is composed of two main services. The Main Backend (Python, serves REST and GraphQL APIs) and the Live Service (Elixir, handles real-time traffic).

While Live Service had existing overload protections, limiting socket connections and channel joins, Main Backend didn’t. Given the different traffic patterns and implementation languages, we couldn’t reuse mechanisms from Live Service. We needed a new strategy for unexpected traffic spikes in Main Backend’s endpoints.

Beyond local limits: The case for distributed traffic management

Local, in-memory traffic management is ideal but not feasible for our Main Backend as its deployment consists of thousands of tiny Kubernetes pods. Locally we may see misleading low rates per pod while the system-wide request rate skyrockets.

We needed a non-durable storage with extremely low latency able to handle thousands of requests per second. Enter Redis, our trusted ally we already used for distributed caching, rate limiting, and leases.

Choosing our battle: Rate, cost, and concurrency limiting

Our primary mission was to manage excessive load hitting our Python service. We considered several approaches:

Sliding window rate limiter

Pro: Very simple to implement and reason about.
We allow only N requests every M seconds.
Con: Doesn’t handle bursty traffic well.
Imagine a seller running a giveaway of an expensive card that gets shared across social networks and brings in a spike of users into the platform.

Leaky bucket rate limiter

Pro: Handles bursts of traffic smoothly.
We allow N requests every M seconds, allowing for short finite spikes.
Con: Doesn’t consider the cost of requests.
Once, a seller made a livestream with an abnormally high number of products in shop, resulting in clients flooding Main Backend with expensive and long-running requests whenever a new user joined. We’d like to allow bursts of cheap requests, but not necessarily the expensive ones.

Cost limiter

Pro: Rejects more requests if their cost is high.
We allow requests to take only N seconds in total in every M seconds time window.
Con: Works poorly for very slow requests.
The cost calculation can realistically only be done at the end of the request as keeping track of execution time of all currently handled requests would put a strain on the underlying storage. If thousands of slow requests enter the system at once and don’t return for a few seconds the cost limiter allows new requests to flow through.

We settled on a distributed concurrency limiter, illustrated by the diagram below, that increases a concurrency score as requests enter and decreases it as they leave, throttling requests exceeding the concurrency limit.

This deceptively simple idea solves our problems in an elegant and resource-efficient way:

Pro: Simple to reason about.
The concurrency score tracks how many requests are currently in flight.
Pro: Handles bursts of traffic well.
We’re able to handle bursts of traffic as long as requests are fast.
Pro: Throttles expensive requests more aggressively.
Slow requests that don’t leave the system fast enough are throttled.
Con: Complex implementation.
The added complexity is better than having a solution not meeting product requirements.

The Race Against Time

In line with our principles of setting crazy goals and moving uncomfortably fast we took off to implement the concurrency limiter three weeks before the Super Bowl.

Countdown to kickoff: Pre-Super Bowl implementation insights

We decided to use a request specification as a unit to limit. In our case it can be one of two things:

A single HTTP request sent to a REST endpoint, for example, POST /api/v2/login.
A set of GraphQL queries sent in a single HTTP request — all such requests have an identifier in our system that we can use, for example POST /graphql?id=livestream

The most straightforward way to implement a concurrency limiter would involve having a Redis counter for each request specification. When a request enters the system we increment the counter, to then decrement it when it leaves. It’s unfortunately not viable in a real-world distributed system as servers can be terminated before they get to send decrement commands to Redis which would result in ever-increasing counters.

We chose to use an approximate concurrency calculation with a moving time window. Each minute has its own concurrency counter, and the concurrency score considers the current and previous time windows, proportionally to the elapsed time since the latter’s end.

Pictures say more than a thousand words so let’s use a visual example of a Feed request, for which we set the concurrency limit to 3:

It’s worth noting that the concurrency of the previous window can still be altered after its completion, as seen in the example above when response #2 is sent while another window is already in progress.

This approach allowed us to:

Remain memory efficient: We’re using two counters per request specification at any given time.
Avoid ever-increasing concurrency: In case of fatal failures before or during counter decrements we’ll reset the score as soon as the current window stops being taken into account, which is twice its length.
Keep latency overhead low: Incrementing the counters and score calculation under 10k QPS load has a median of 1 ms and p99 of 7 ms.

We’re able to set a limit and window size per request specification. To maintain the ability to tweak the limits during incidents and large events the configuration is stored in a remote SaaS backed configuration service, with in-memory overrides allowed to ensure the platform is fully operational even when the SaaS vendor is down.

You might be wondering how the initial limits were determined as it’s pretty hard to guess proper limits for hundreds of different requests. We rolled out the concurrency limiter in a dry-run mode where it collected concurrency scores but did not throttle any requests. We then took the per-request concurrency limits from a busy weekend’s peak hours and set their 99th percentile as initial limits for all requests.

Fortifying the frontlines: Stress testing

Our Elixir-based resiliency framework simulated tens of thousands of user scenarios to stress test the limiter. This revealed key insights and led to crucial adjustments:

Client-side connection pooling: Our initial test concluded swiftly as thousands of requests flowing through the middleware exhausted Redis’ server-side connection pool, causing it to stop accepting requests.
We introduced the connection pooling on the client side and idle connection timeouts on the server side to avoid that.
Fail-open strategy: The subsequent test ran smoothly, but we observed a significant dip in our core product SLIs. Most of the errors contributing to it were related to two concurrency limiter errors: exhausted client-side connection pool and timeouts when sending requests to Redis.
We made a decision to fail-open on any issues in communication with Redis or calculation of the score. It’s especially important as Redis is a single node deployment and in case of its failover, we must ensure that clients’ communication with the Main backend is allowed and not completely blocked.
Minimal timeouts: In the next test we observed a large increase in tail latencies, in order of seconds. It turned out that under very high load Redis latency was spiking and impacting user experience negatively.
We reduced the client-side Redis timeouts to 250 ms. This decision slightly increased the concurrency limiter’s error rate, but combined with a fail-open strategy it’s able to handle more requests without negatively impacting the user-visible response times.

With these tweaks in place, we were able to run a very large stress test and eventually overload both the concurrency limiter and the Main Backend along with its database but it took amounts of traffic that we’re not expecting to observe in the foreseeable future.

Fairness in the fray: IP-based concurrency limiting

Stress testing highlighted potential monopolization by a few users. When we joined test livestreams with stress tests running to see how the user experience looks like we’ve seen our requests throttled. This led us to question the fairness of allowing simulated users to take over the system’s resources, potentially hindering other users. Simulated users could as well be bots trying to scrape our website and we don’t want that to impact real users’ experience.

We decided to introduce additional concurrency limits per user. We chose to use IP addresses as user identifiers, a method effective for both authenticated and unauthenticated users. As the implementation of the concurrency limiter is rather simple the only thing we had to do is increase two different counters (one for request specification and one for IP address) before the request started, checking both for limits exceeded. As we keep only one counter per IP in each time window the solution remains memory efficient even for large number of concurrent users.

Let’s take a look at an example with Feed request’s concurrency limited to 3 while per IP concurrency limit is set to 2:

This approach proved to be highly effective. We ran the same test as before but this time test users got to send only a few requests at a time, while we were able to enjoy watching test livestreams without interruption. Naturally, for stress testing purposes, we now disable IP-based limits for simulated users.

Trial by fire

Super Bowl surge: The first real test

We anticipated a significant increase in traffic during the Super Bowl weekend. The actual traffic surge surpassed our expectations — at peak we nearly tripled our regular load! Despite the surge, the concurrency limiter remained largely inactive. Throttling was limited to individual requests mostly and to a few minutes where we saw a slight increase in the number of requests related to shipping estimates, which the concurrency limiter throttled as expected:

The event went very smoothly: other preemptive measures, such as adding additional PostgreSQL read replicas and scaling up our services, effectively managed the increased load. We were still certain that all our efforts weren’t in vain and that the concurrency limiter will present its value at some point — we just didn’t know when.

Unexpected gameplay: Should we mention ‘The Game Show’?

The answer came a few weeks later. Our first Whatnot Game Show on March 15th showcased our newest feature, Trivia. During the livestream there was a short period of time, right after a user won a large giveaway, where other users started cheering them by writing @username in chat. This feature prompted all connected clients to resolve the mentioned username via a GraphQL request. That caused a cascading effect where every mention resulted in thousands of requests sent. Below you’ll find the total number of calls received by the Main Backend, compared to the traffic from the week before:

This single query almost tripled our total request volume! The resulting SQL query from this request could have overwhelmed our PostgreSQL instance and increased system response times. It would’ve if it weren’t for the concurrency limiter:

As you can see the concurrency limiter throttled requests extremely well. When the latency of this particular request started increasing we started throttling more until the peak was gone and we went back to normal.

We have implemented additional safety mechanisms on both client and server sides to prevent similar situations in the future. However this proved that the concurrency limiter is very effective in detecting and dealing with unexpected load patterns that could’ve taken the platform down.

Next steps in traffic management

We’re considering future improvements, including:

Device ID-based Limits: To avoid disadvantages for users sharing IPs.
Request prioritization: To be able to discard stale and duplicate requests to reduce load.
Decrementing after a response is sent: Further minimizing latency of load shedding.

Reflecting on our journey

Through careful design, existing tools, and ruthless prioritization, we built an effective concurrency limiter. It’s already uncovering client-side issues and enhancing our platform’s resilience. Traffic management is an ongoing challenge, and we’re ready for more innovative solutions ahead.

Excited by what you’ve read? Want to join us in building reliable systems? We’re hiring!

See you soon? 😉