Scaling our Data Stack with Kafka and Real-Time Stream Processing

Published in

Whatnot Engineering

7 min readSep 22, 2022

Zack Klein | Software Engineer, Machine Learning & Data Platforms

In Q2 of 2022, we set out to build a new foundational piece of our data stack: a real-time streaming & processing platform.

Whatnot has grown a lot over the past year, along with the complexity of the problems to solve. In response, we turned to an event bus — a pipeline that receives events — and built a modern data stack that can support a variety of data science and analytics applications. We’ve written about our platform and some of the use cases it supports (here, here, and here) — check those out if you’d like more context or to get a sense of some of the concrete applications for this stack.

This post will give an overview of this system, the reasons we built it, and some of the strategic decisions we made around testing, event schematization, event serialization, and stream processing.

The problems

After about a year of data engineering and machine learning in production, we felt a big pain point: that coupling to transactional systems can make analysis much harder than it needs to be.

For example — in early May of 2022, we shipped a feature called Reactions. With this feature, users can tap a heart button inside a livestream to trigger an animation overlaying the video feed.

For most features up until this use case, our Extract, Transform, Load (ETL) strategy has been to process batches of change-data-capture (CDC) logs from our app’s transactional systems into our data warehouse. After being ingested, DBT transforms these logs into user-friendly tables that get consumed by several downstream applications (humans/BI tools, ML models, etc.). Historically, this approach has worked nicely for us.

But — there’s a big assumption baked into this strategy: that the data we’d like to analyze exists in the CDC logs in the first place!

Reactions broke this assumption.

When we implemented Reactions, we chose not to store every single reaction in the transactional system. This is because they are extremely write-heavy — they fire when any user in any livestream taps the react button — and not storing every single reaction helped reduce the volume of writes (and therefore the overall load) on the storage layer.

This decision was the right one for the feature — but it created a separate problem: it turned valuable information for analysis/modeling into “dark data” — data that we know exists, but that we don’t capture anywhere in our transactional systems.

This use case highlighted the need for a different paradigm: we needed to decouple a feature’s storage implementation from how we emit data about its usage for analysis and modeling.

Solution

So, keeping in mind that the main requirement for this system was to decouple how features store their data from how they report data, we chose to introduce an “event bus” into our stack — with Apache Kafka as the backbone.

Three of Whatnot’s backend services would need to send events to the event bus:

the main backend service — used for critical slower-moving data like orders and shipments
the live service — used for operational faster-moving data like chats and auctions
the ranking service — used to drive the home feed

When looking into the different ways we could run Apache Kafka, we evaluated solutions based on the following criteria:

Low effort to manage the infrastructure
Ability to use Terraform to define our environment in code — this is critical for us to be able to reproduce new environments easily
Ability to easily use Kafka connect to create integrations with other existing systems
Native integration with DataDog
Robust support plans to make sure we get the help we need when we need it

Ultimately, we chose to run Kafka using Confluent Cloud because it met these requirements.

Since each of the backend services that would be our first event producers operate under their own unique set of constraints, we had to make some tough decisions about how to implement them. I’ll highlight a few of the key decisions we made: how we implemented tests, schemas, serialization, and stream processing.

Testing event producers

From the beginning, we knew the reliability of these event producers would be critical. What better way to help ensure the producers behave as expected than by writing unit tests that run every time any change to the code is checked in?!

We chose that wherever we implemented the event bus, we’d run a local Kafka cluster in CI/CD using Docker. This allows the application to send events in tests in a way that is extremely similar to production, giving us the ability to write black-box tests.

These types of tests have proven to be tremendously valuable, for reasons including:

We verify the event’s structure by confirming it is (de)serializable and contains relevant fields.
We verify the correctness of the event (by checking if this event contains this specific piece of data).
We can change the implementation details of the event production however we like without needing to change the tests.

(Check out this article for more about how we like to do testing).

Schemas

Kafka practitioners need to make a tricky decision when they write event producers — they need to decide on schema(s) for different event types. There’s a spectrum here — one extreme chooses a single schema for all event types and the other a bespoke schema for each individual event type.

We landed somewhere in the middle — we looked at our use cases and came up with a few event schemas that covered the majority of the event types, and implemented a handful, including:

User action events fire after users take particular actions like chatting or tapping the react button.
Livestream product events fire when products in a livestream go through a relevant state change, like when an auction for a product starts or ends.

So far, this choice has paid off nicely because adding new events is safe and easy. When we want to add new events that match our existing schemas, we don’t need to add new Kafka topics, data pipelines, data warehouse models, etc. We just check in a change to the schema, write the code, write a test, and you’re good to go!

Message serialization

Kafka practitioners must also choose a serialization format for messages they send over the wire. Many choose a binary serialization format like Protocol Buffers or Apache Avro because they typically have better performance and backward compatibility than human-readable formats like JSON.

We decided to use JSON as our message serialization format for simplicity. We determined we could get our first event producers shipped faster if we didn’t use Protocol Buffers or Avro.

However, we know that at some point we will want to revisit this, so we implemented our event producers in such a way that the serialization details are abstracted away from the callers. The following code example isn’t an exact representation but gives the gist of how we separate message serialization details from code that emits events, which has allowed us to move quickly, while also giving us a future path to change serialization format.

Stream processing

Once events fire from the producers and flow through the system, Kafka practitioners need to implement stream processing to turn individual events into something more useful. For this, we used KSQL and Rockset.

For lighter operations (e.g. simple joins and roll-ups) and for storing/serving the results of the stream processing so they can be used back in the application, we use Rockset. Xin’s post on real-time signals in our home feed is a great read for more details on how we use Rockset. For more intensive stream processing (e.g. rolling window aggregations), we use KSQL, which allows us to emit metrics back into Rockset for serving.

The combination of these two systems allows us to calculate real-time metrics based on our events and put them back into the app in near real-time!

Outcome

This project has been a massive unlock for us. With Reactions, for example, we are now able to comfortably ingest tens of millions of Reaction events per day, in a schema designed specifically to be friendly for analysis/modeling, while also remaining completely decoupled from the feature’s implementation of its storage.

In addition to achieving the initial goal of decoupling our events from the way storage is implemented, it has also allowed us to introduce real-time features.

Thanks for reading this post! Do you like working with Kafka? Or putting data into production? If so, WE ARE HIRING — please don’t be shy and reach out.