Whatnot’s Data-Driven Approach to Scalability & Reliability for Big On-Platform Events

Published in

Whatnot Engineering

9 min readJun 18, 2024

Introducing the first of our 3-part series on how Whatnot built the tech infrastructure for a major 0% commission pregame event during the 2024 Super Bowl on 2/11/2024, a major sporting event that many of our Whatnot users watch. This initial post sets the scene for how we approached engineering going into the event, focusing on the challenges of handling real-time demands and massive scale in livestream shopping. As we dive into each subsequent post, we’ll break down different aspects of our tech journey, explaining the tools and methods we used. Join us as we pull back the curtain on how we made sure everything ran smoothly for our buyers and sellers.

Andrey Kravtsov | Engineering

As our user base and platform complexity grow, the impact of early design decisions on reliability becomes more apparent. Implementation choices that initially seemed viable can lead to systemic issues, affecting user experience and creating challenges for ongoing feature development. Addressing these limitations is crucial to ensuring a seamless and reliable experience for our users.

At Whatnot, this was especially obvious during large-scale events like Black Friday and the Super Bowl, which resulted in a substantial surge of activity. Additionally, the number of users and sellers has continued to grow, creating a stronger demand for a frictionless and reliable experience no matter how busy our platform gets.

To tackle these challenges head-on, we adopted a systematic data-driven approach to improve reliability. In this post, we share 5 concrete solutions we adopted to manage these challenges not only during peak hours but also during large-scale events:

Managing Deprecated APIs and Libraries
Taming Slow Database SQL Queries
Migrating from Sync to Async IO
Optimizing GraphQL Resolver Fanout Performance
Optimizing Resource Consumption

Managing Deprecated APIs and Libraries

One of the focused areas of investment is the introduction of standardized APIs for our team to use that come with automatic observability, circuit-breaking, and rate-limiting among other key features. For example, this would include thin layers over third-party IO libraries,e.g. requests, that would enable us to more easily observe degraded performance and mitigate an outage.

While we did our best to educate everyone about the change, it was still too easy for them to accidentally continue to use deprecated solutions. This was especially easy when the migration to the new implementation was not yet completed and colleagues were replicating an existing solution.

As our primary platform is written in Python we relied on ‘banned-api’ ruff linter for the most obvious use cases such as using requests library directly instead of internal wrappers. Linter made it easy to flag specific packages and functions that we wanted to discourage use of. When a person attempts to use such APIs, the linter failure blocks the pull request. For example, if a developer would import requests library to make an uninstrumented HTTP call to another service, linter would issue a warning that this is not recommended.

This linter warning would prevent the individual from submitting a commit into the project repository. If the person suppresses the lint failure, we have a Github action script that will automatically block the PR from merging and trigger additional reviewers to be added to discuss the validity of the use case.

By using the ‘banned-api’ linter and allowing existing violations, we were able to effectively stop the proliferation of undesirable practices while providing teams with a manageable path to fixing existing issues. As teams addressed the linter-ignored call sites, we saw a steady decrease in the number of disabled lint checks, indicating a positive trend toward better coding practices and improved reliability with fewer uninstrumented APIs.

Taming Slow SQL Queries: Iterative Optimization with Instrumentation and Watermarks

Whatnot heavily relies on Aurora RDS and we started to discover a lot of call sites across our platform where the queries did not perform well but were not subject to reasonable timeouts. When using a large timeout value against the database we could easily see how even one inefficient query could lead to a severe outage. As requests come in under high load and get stuck running such slow queries, they end up quickly exhausting shared pooled database connections resulting in a locked-up system. Setting a default low (i.e. 500ms) timeout would make such a scenario much less likely but would also risk breaking the user experience in places that already had slow database queries. Our approach involved building observability to discover where the queries were running slow and warning everyone before we dropped the 🔨 on timeout value across the board.

To set the query timeout we used postgres statement_timeout, which ensured that the timeout was on the RDS side while also having us control to set custom timeouts where needed. To generate a system-wide slow query warning we used SQLAlchemy event callbacks. Our instrumentation used before_cursor_execute and after_cursor_execute events to precisely report both the query and the calling stack. This made it simpler for the engineers to observe where slow queries were taking place as well as their frequency. We iterated by advancing both the warning and the timeout thresholds lower at the same time.

For example, we started with a timeout at 30 seconds and a warning at 15 seconds. We provided developers with a special dashboard with the slow queries (see image below), gave them some time to address the findings, and then lowered the pair of values again. We kept lowering the pair of thresholds until we got to a much lower timeout threshold. For example, we would iterate downward setting timeout to the previous warning threshold and setting new warning further below — (30,15), (15,10), (10,5), (5,2) etc. At the time of this screenshot, we were at a warning of 1 second and a timeout of 2 seconds.

After multiple iterations, we arrived at a timeout value that maintained a good user experience while protecting the system from the introduction of inefficient queries. In the process, we revealed queries that were not tuned with appropriate indices as well as queries that were better suited to use other systems at our disposal, for example, some queries were performing analytics-type aggregations on large sets of rows where freshness was not critical and could be calculated offline in our data warehouse.

Migrating to Asyncio: Tackling the Challenge of Blocking Code

As we wanted to support higher request concurrency in our Python backend, we started migrating our Python backend to use asyncio. As we gradually migrated our codebase to support asyncio, we encountered a common challenge: developers inadvertently introduced synchronous code that blocked the asyncio loop. When this happens, the application can suddenly become unresponsive, impacting requests in the same process that are completely unrelated to the one where synchronous code blocks the loop. Detecting such issues syntactically is not straightforward, so we decided to adopt several runtime techniques to address this problem.

Our first approach was to check for the presence of an asyncio loop at the call sites of known blocking IO implementations. This allowed us to identify and flag instances where synchronous code was potentially blocking the loop. However, this method had some limitations, as it could not catch all cases of blocking code.

To create a more comprehensive detection mechanism, we implemented a home-grown monitoring solution that runs on a separate thread. This solution keeps dispatching no-op tasks and uses a separate thread to ensure that they are still executing. When these tasks get stuck, we know the loop is likely blocked. At such moment the monitoring solution takes a stack dump of the loop thread (subject to sampling to avoid constant overhead) allowing it to reveal common instances of non-async code blocking the loop.

By automating this process and creating a dashboard with a set of metrics, we made it easier for engineers to spot and address blocking code issues quickly, often soon after or even during deployments with related alarms.

Solving IO Fanouts: Detecting and Optimizing GraphQL Resolver Performance

In our GraphQL-based request handling, we encountered another common challenge: the presence of IO fan-outs also known as the N+1 problem. IO fan-outs happen when the logic in your back-end is repeatedly fetching data from external systems, i.e. RDS, instead of efficiently batching such requests. Sometimes this could lead to many thousands of calls where a single batched one could be used instead. As we heavily rely on GraphQL, the collection processing in responses often involves repeating fetches of data that could have been batched, leading to inefficient IO operations. The problem manifests itself when a seemingly innocuous GraphQL request to return a list of entities suddenly starts to make thousands of SQL or other API calls to repeatedly look up the same type of value per entity instead of making a single batched call.

To address this issue, we adopted python library aioataloaders, which intercepts and batches such data lookups. However, even with this solution at our disposal, we found it was not always easy for developers to recognize when they should use them. This indicated that we needed a more proactive approach to detect these inefficiencies.

We developed a custom GraphQL extension that detects repeated data fetches. We monitored data fetches through performed at a given GraphQL resolver through a global contextvar. By monitoring the behavior of the resolvers in real time, this tool helps the team recognize where they would likely benefit from batching data fetch using a dataloader.

By combining the use of dataloaders with real-time monitoring and optimization tools we continue to iterate towards more efficient GraphQL performance.

Optimizing Resource Consumption for Scalability

To ensure scalability and reliability, especially during peak events like the Super Bowl, we focused on identifying and addressing areas of excessive resource consumption. We instrumented our request flows to measure various consumption parameters, such as the number of downstream calls, rows returned from RDS, SQL calls made, and other relevant metrics for specific GraphQL queries and REST endpoints.

By aggregating these measurements into dashboards, we gained a clear view of resource utilization across different parts of our system. The dashboards enabled our teams to quickly identify queries or endpoints that were consuming a disproportionate amount of resources, allowing them to proactively investigate and optimize those areas.

This data-driven approach to resource optimization has been crucial in maintaining our platform’s scalability and performance. By continuously monitoring resource utilization and acting on the insights provided by the dashboards, we can anticipate potential bottlenecks and take preventive measures to ensure a consistent and high-quality experience for our users, even during peak events.

Ongoing journey

Ensuring reliability is an ongoing journey. The scalability challenges we’ve faced have led us to develop a proactive, data-driven approach to identifying and addressing potential issues before they impact our users.

Through the solutions we’ve implemented, we’ve seen meaningful improvements in our platform’s stability and performance. Our promotional event before the Super Bowl brought many users and resulted in one of the largest load peaks on our systems without incident. On average, over 2 million hours of live streams are watched every week — that’s the equivalent of 800,000 NBA games watched!

We remain committed to refining our approaches and finding new ways to provide the best possible experience for our growing community. By fostering a culture of continuous improvement and collaboration, we believe we can stay ahead of the challenges that come with growth and deliver on our promise of a reliable platform for our users.

If you enjoy solving complex problems and are interested in building community-focused products, consider joining our team!

Read Part 2 of the Whatnot eng infrastructure Superbowl series here