Preparing the Marketplace for Game Time

Published in

Whatnot Engineering

7 min readSep 12, 2024

Welcome back to the final post in our three-part series on how Whatnot supercharged its tech for the 2024 Super Bowl’s 0% commission pregame event! In this final post, we’re diving deep into our strategy for load testing and bottleneck remediation, crucial steps in ensuring our platform could handle the increased traffic during this major event.

Missed part 1 or part 2? Catch up here: Whatnot’s Data-Driven Approach to Scalability & Reliability for Big On-Platform Events and Expecting the Unexpected: Managing 3x traffic surges at Whatnot.

Rafael Gonzaque | Engineering

At Whatnot, delivering a seamless experience every day is at the core of what we do. This commitment becomes even more crucial during high-traffic events like Black Friday and the Super Bowl, where we see a surge in activity. We’re focused on making sure our marketplace empowers sellers to maximize these key moments while ensuring a smooth, standout experience for buyers — especially those new to Whatnot.

To meet these challenges, we leverage a systematic, data-driven approach to continuously enhance the reliability and performance of our platform. In this post, we share 4 strategies we take to prepare for large-scale events and ensure the marketplace can meet the demand:

Load testing to identify bottlenecks;
Bottleneck remediation to increase our throughput and protect downstream components;
Failover tactics to ensure critical paths remain available;
Operational preparedness to build confidence by enhancing our observability and anticipating failure scenarios

Load Testing

As our platform evolves — with new code, IO patterns, and updated component configurations — load testing gives us more than just the chance to identify bottlenecks; it’s also an opportunity to redefine the upper limits of a feature. The process starts by forming a hypothesis and defining two system states based on real-world variables, like the rate of users on the platform, the number of viewers joining a live show, and the total number of active live shows.

The first is the stable state, where the system operates under normal conditions with metrics and SLO thresholds performing as expected. The second is the target state, where we introduce specific variables to stress-test the system. By applying these variables in a controlled way, we can either prove or disprove our hypothesis. For the Super Bowl, our target was to double the peak traffic we’d seen in the last 12 months, pushing the limits of our platform to prepare for the event.

Our Foundations team oversees platform and infrastructure engineering, while also supporting product engineering teams in a role similar to what some organizations might call “site reliability engineering” (SRE). They’ve developed a robust load testing infrastructure that empowers teams like mine — the Marketplace team — to simulate massive amounts of traffic based on real user actions, such as logging in, joining a live show, opening a shop, or placing a bid. This allows us to closely monitor how our backend systems perform under significant load.

Once we provision a load test environment that mirrors our production system, we’re ready to begin. We always start by assessing the stable state before escalating to the target state. Throughout this process, operational dashboards are key in analyzing detailed findings.

In our tests, we discovered that throughput began to falter when we applied just 30% of our target load. The issue surfaced as errors in our Shop query, where requests timed out due to taking too long to process. This query is responsible for fetching product listings during a live show, so identifying and resolving this bottleneck was critical to ensuring smooth performance.

Bottleneck Remediation

After identifying the bottleneck that prevented us from achieving our target throughput, we dove into solving the issue. Our first step was to review the operational metrics for the Shop query, which showed that most of the request time wasn’t being spent on the actual database query. In fact, the primary query’s execution time was just a fraction of the overall latency, indicating that something else was causing the slowdown.

We then reviewed the code to pinpoint the source of the delay. This led us to a call to Elasticsearch that was performing aggregations on every request. These aggregations power the filters users see to refine products in the Shop, but running them on every request was significantly impacting performance.

We realized that refreshing the Filters UI in the Livestream shop didn’t need to happen as frequently, so we initially planned to limit it to the page 1 request. However, implementing this proved challenging because our client was designed to render the Filters UI on every request. To make this change, we would have needed to release new client changes for both Android and iOS, then push users to upgrade quickly before the event — adding too much delay and friction.

So, we took a different approach. Since the bottleneck was caused by repeated I/O operations against Elasticsearch, the solution was to reduce the number of calls to those queries. We decided to cache the results on the server with a reasonable TTL (Time To Live). This allowed us to serve cached filter options while still maintaining the API contract with our clients. The TTL was carefully set: high enough to significantly improve latency, but low enough that any staleness in the filters was acceptable.

As we implemented the solution, we realized we could optimize further by caching the filter options for the entire Live Show each time we updated the cache. After rolling out these changes, we saw a 24% reduction in p95 Shop query latency for one query type and a 54% reduction for others, greatly improving overall performance.

During a previous large event (Black Friday), we saw a significant spike in listing creation, which almost overwhelmed our job queue. Thankfully, a quick-thinking on-call engineer was able to increase the number of workers just in time to avoid a backlog. To prevent this from happening again, we implemented a global rate limit on the ingress, tailored to the capacity of our job queue cluster. This safeguard ensures that the system can handle sudden surges in task processing without getting overwhelmed, allowing it to catch up at its own pace.

Failover Tactics

In addition to load testing, we also incorporate chaos testing to prepare for unexpected failures. One scenario we tested involved the Live Service, an Elixir service that powers chat and broadcasts messages to other users watching the same Live Show, and if it were to become unavailable. While these features aren’t in our highest priority tier, this service acts as the source of truth for Live Show data. Since we use GraphQL for client communication, any disruption to the Live Service had a broader impact — making the Shop unavailable. This was because the Shop query relied on a field resolver attached to the Live Show object fetched through the Live Service. Another challenge to tackle.

The Live Show object is eventually replicated into our Main Backend Python service, which also supports the GraphQL interface. This structure led us to explore using the replicated data model during fallback scenarios. To make this work, we developed a “shim” class, which utilized the replicated data from the monolith to provide reasonable default values to the GraphQL type’s field resolvers. This way, even if the Live Service became unavailable, we could still serve essential data and maintain a functional Shop experience.

Once we integrated the fallback shim to handle 5xx-level errors from the source of truth, it became a bit of a whack-a-mole situation. We had to carefully stub out fields that were expected to be non-null in our top 10 clients across both Android and iOS. After addressing those issues, we successfully created a degraded state that allowed users to continue joining Live Shows, opening the Shop, and engaging in commerce — all while minimizing disruption even during service outages!

Operational Preparedness

With the introduction of new behaviors and controls, our final step was to thoroughly document everything we had implemented and learned in our runbook. For load shedding, we added a new section that categorized each major product area. Each group detailed (a) the specific configuration changes required, (b) the type of load reduction (i.e., which component would be impacted), and © a concise description of how these changes would affect the end user experience.

This documentation process also provided an opportunity to ensure that our dashboards offered the right insights into system behavior. We used this time to audit existing alerts and create new ones where necessary, making sure we had complete visibility into how the platform would perform during high-demand events.

Summary

The preparations we undertake ensure smooth operations during large-scale events and help us anticipate potential failures before they impact our end users. To summarize our approach: we use load testing to identify areas for improvement, address bottlenecks by either resolving the issues or mitigating their effects, and continuously enhance our operational documentation. This process underscores our commitment to ongoing improvement as we scale our systems, empowering Sellers to grow their businesses and create delightful experiences for Buyers.

If this sounds interesting to you, we’re hiring! Join us in building a world-class marketplace that is open, safe, and uniquely Whatnot.