Data Contracts in the Modern Data Stack

Zack Klein | Engineer, Machine Learning and Data Platforms

This post is the written version of our talk at Data Council (Austin 2023). You can find the video for that here.

As Whatnot continues to experience remarkable growth, the sheer amount of data being processed by our data stack has skyrocketed. In previous articles, we shared insights into overcoming various scaling challenges encountered.

This blog post will focus specifically on how we have effectively tackled the surge in the usage of analytics events within our data stack. These events, implemented by engineers to track user or system activity in the app, have multiplied exponentially since the inception of our data stack, increasing over 100x in just over two years. On average, we have witnessed a monthly increase of approximately 30% in the number of analytics events.

Lots of events!

We’re thrilled with the increased usage of our platform — but we have experienced some pain points. This post will go into detail about how and why we introduced data contracts into our events system to solve some of these challenges, the benefits we’ve seen from doing so, and some of the things we learned the hard way.

The problems

Problem 1: Too much noise

As our application started emitting more events, we noticed our event data was confusing and inconsistent — and there was a ton of it. We had over 300 (!!) different types of events, despite being barely over three years old as a company. Events weren’t named consistently, fields within these events weren’t always the same data type, and different platforms implemented the same events with subtly different fields. All of this combined for an unpleasant user experience for data consumers — like having too many tables there to analyze in the first place:

Too many tables

Problem 2: Too much inconsistency

At Whatnot, we love building things exceptionally fast. As we grew, we noticed we’d have the same conversations when adding logging to a new feature. For example, “I want to log event A from feature A, which is kind of like event B from feature B, but it’s also a little like event C from feature C”. The lack of structured thinking around events forced producers to make decisions about logging without input from data consumers. This is a lose-lose situation: it slows down feature teams and leads to data consumers not getting the data they need.

Problem 3: Too hard to fix things

Software that produces events is just like any software: sometimes it doesn’t run when it’s supposed to, sometimes it’s subtly incorrect, etc. We noticed it was difficult to fix issues because there weren’t clear maintainers for each event. This meant the data consumer became responsible for tracking down the right person (or people) to fix issues — which led to fixes taking much longer than they needed to.

The solution: high level

We made two big decisions at the beginning of this journey to solve the problems above at a high level:

  1. We will have one “data highway” for logging analytics events. This means that logging analytics events anywhere would be the same as logging analytics events everywhere, and we would only support one company-wide platform that everyone would use.
  2. We will have a standard schema for analytics events. We chose to describe our analytics events according to the framework: Actor Action Object (e.g. user placed order or system sent push notification). While we don’t expect this to cover absolutely every use case, it covers the vast majority, which dramatically accelerates the conversation around modeling events so we don’t need to start from scratch every time.

These two decisions unified our logging framework into something consistent for all parties. And keeping these two high-level decisions in mind, we implemented a system to help us accomplish these goals at scale.

The solution: details

Here is an architecture diagram of the system we built based on the decisions we made above:

Our data contracts implementation

There are four key components in this architecture. Let’s break them down one by one.

First, we implemented consistent producer interfaces in each producer platform (the green box on the left side of the diagram). Concretely, this means we implemented a library in each platform where we emit events that producers must use. These libraries conform to a common protocol, so engineers who work across multiple codebases have a consistent experience when they emit analytics events.

An example of our unified schema format

Second, we built a shared schema (the blue box in the middle of the diagram). This shared schema is defined as a Protobuf message in a separate git repository. All events follow the same schema: there is a generic “header” with important metadata that gets automatically collected by the library (a unique event ID, a timestamp, etc.), and then the event itself. This repository also houses code that gets generated from this schema — the libraries in each platform expect and require the usage of this generated code.

Third, we reused our existing data pipelines (the yellow box middle of the diagram) to transport data from producers to consumers. There are two features of this part of the stack. First, we chose to abstract out the pipeline details (which tools we use, the batching intervals at which we send messages, the retry logic when message production fails, etc.) for data producers. This abstraction allows us the flexibility to change implementation details about the pipeline later without the need for massive refactoring.

The second key feature is that the pipelines are near-real-time. Our data pipeline infrastructure receives events and makes them available for subscription in low single-digit seconds.

We went from hundreds of tables to two!

Fourth and finally, we built exposures (the red box on the right side of the diagram). These are simplified, lightly transformed interfaces that are the only interfaces consumers can use — they are the public API of the analytics events pipeline for data consumers. As mentioned above, the pipelines upstream of these tables are real-time. This allows us to update these exposures extremely quickly. For example with Snowflake, we take advantage of Snowpipe and micro-batching to make data available to query in these user-friendly, high-quality tables within five minutes of event production.

An example

While the concepts behind this architecture are abstract, they are simple to implement and operate in practice. To demonstrate, let’s walk through what it would take for an event producer to implement an event that fires when a user follows another user. There are three steps to this process:

Step 1: Declare a schema

User followed user event

Everything starts with the schema. The first thing an event producer does is define the structure of the data they would like to emit. Once they’ve done this, they are also required to “register” their event by attaching metadata to it that includes a description and the team that owns it (we use Protobuf custom options for this). That looks like the below:

User followed user event with metadata

Brief side note — we are extremely excited to have this metadata in the schema. We plan to use it for some cool things like automatically generated DBT documentation and QA checks, and more! Stay tuned for more on that.

Step 2: Implement the producer

Once the schema has been declared, the producer’s job is then to generate code from the schema and implement it in the producer using the common library. In Python, this looks like this:

User followed user event, implemented

There’s one step missing from this function where we actually send the data to the pipelines but hopefully, you get the idea. The next step of the process is simple: just implement this function wherever you’d like to emit your analytics event!

Calling the user_followed_user function

Step 3: Query!

Assuming the code implemented in the producer in the previous step runs without issues (we’ll talk about how we do testing in the next section), then the producer is done. They can ship their code with high confidence, and once it’s in production they can query for their events:

Querying for user followed user events

Monitoring and quality

The architecture and example above define a contract — a Protobuf message — that dictates how data should look when it fires in production. It implements tools and infrastructure to enforce this contract at various stages of an event’s lifecycle.

However, while tools like code generation, common libraries, and unit tests go a long way toward enforcing this contract in production — it won’t be perfect. Therefore, in addition to the tooling that runs in production, we also need rigorous monitoring to enforce this contract — so we can catch quality issues as soon as possible.

As part of this system, there are three places we catch issues:

Before anything gets implemented

As part of the shared schema GitHub repository we use Buf and other custom tests we’ve written ourselves to check that only certain types of changes are allowed (backward compatible, enforcing naming conventions on enums, and attaching certain pieces of metadata using Protobuf options).

During development

As part of the common testing libraries, we implemented testing harnesses in each of the event producers that allow engineers to test event producer code using our normal testing tooling. This allows engineers to test their code for syntactical correctness (the right event fires at the right time) and semantic correctness (the event fires with the right data).

In the wild

Finally, we use Monte Carlo to monitor the data after it arrives in the data warehouse. This type of monitoring helps us validate freshness (that data arrived on time), dimension drift (when we receive new events), anomaly detection (when we receive a sudden large increase or decrease in event counts), and more.

Wrapping up

Results

In just a few months, we have already seen some incredible results. More than half of our engineering organization (~80 out of 100+ engineers) have already contributed changes to our schema.

Most importantly, we have seen a huge amount of collaboration between event producers and event consumers. It’s now standard for producers and consumers to talk through what data they’d like to produce from a feature, how this data will be used to analyze the feature’s impact, and then use the shared schema to solidify this agreement in a contract.

What comes next

We like to think our journey has only just started. The contracts we have implemented above give us a great foundation — but there is still much to be done, including:

  1. Tracking more semantic qualities (who maintains the data, who consumes the data, etc.) using these data contracts.
  2. Implementing similar contracts for some of our other important data sources.
  3. Introducing automatic logging for certain core, reusable parts of our application built on top of data contracts

We are now emitting more data than ever, but it’s fully schematized, and, when it breaks, we know exactly how to fix it. As we work through these next steps, we will have more details and lessons to share!

If this post interests you, we’re looking for product managers and software engineers to join us! Find open roles at https://www.whatnot.com/careers.

--

--