Building a Modern Data Stack at Whatnot

Published in

Whatnot Engineering

6 min readMar 18, 2022

The ecosystem of tools and platforms in the data science and analytics world has changed A LOT in the past decade or so — which makes building/assembling a new stack from scratch a whole lot of fun but also a little intimidating. So, we wanted to share the tools we’ve chosen for our data stack at Whatnot, the patterns we’ve implemented with these tools, and some of our learnings along the way.

We began this journey in June of 2021 after a few trials exploring some different potential architectures for our data stack.

The data stack: first principles

Before picking any of the tools in our stack, we set some clear goals to help guide us — since we knew the core set of technologies we chose would determine a lot of our capabilities in the long run. Two things were clear from the beginning:

The data stack needs to support business-critical reporting. Data is essential to our company culture. One example: each team (technical and non-technical) publishes weekly updates. Each of these updates includes a breakdown of the team’s key metrics and an analysis of how the team’s actions have moved these metrics. Good data is the lifeblood of this process — without reliable metrics, we don’t have a way to understand if we’re succeeding or not. The data stack powers the vast majority of this reporting.
The data stack needs to support data science in production. We knew from the beginning that we would need to do “data science in production” to build an excellent product for our users at scale. Real-time livestream discovery and trust & safety were (and remain) two critical use cases for data science and machine learning in our application. The data stack needs to be the foundation that these core services leverage.

There’s a nice narrative between these two goals: first, we start by building out a solid foundation for internal reporting. Second, we use that foundation to build analytic capabilities into our app itself. This “data science journey” has been nicely summarized in the Data Science Hierarchy of Needs, shown below:

The goal of building out our data stack was to establish the bottom two sections (and some of the third section) of this pyramid. So… how did we do it?

The data stack: the tools

The architecture diagram above shows a high-level, holistic view of our data stack and the ecosystem of partners that have helped us move extremely quickly. To briefly summarize the components of this architecture:

Data comes from a few core services (eg Segment, AWS services such as S3, DMS, DynamoDB streams)
We batch/stream that data into blob storage (S3)
We load that raw data into our data warehouse (Snowflake) using a scheduler (Airflow)
We transform the raw data into curated, consumer-friendly views (DBT)
Data consumers (other production services, data science/ML services, analysis tools, etc.) use these views
We monitor end-to-end (DataDog and Slack)

I won’t discuss all the nuts and bolts here, but there are a few interesting areas to highlight.

First, the data warehouse: we have partnered closely with Snowflake as our data warehouse provider. We chose Snowflake because of its:

Excellent support for semi-structured and unstructured data (which we have a lot of!)
SQL-first approach and simplicity/ease-of-use
Horizontal and vertical scalability
Native compatibility with other tools we know and use (Sigma, DBT)
Robust, flexible, and extendable security features

Second, for our data transformation layer, we rely on two tools: Apache Airflow and DBT. For any data replication work (where we transform data before it is loaded into the data warehouse) we write the logic in Python that transforms and loads the data. We also use Airflow to orchestrate training machine-learning models.

For our ELT work (where we transform data after it is loaded into the warehouse) we heavily leverage DBT. Our team is particularly excited about continuing to build this part of the stack: DBT is the secret sauce we use to take raw data from disparate systems and formats and unify it so people or other systems can easily consume it.

What makes this so exciting is that we have opened this layer of the stack up for our awesome team of data scientists, analysts, and analytics engineers to self-service. As a diverse data team, we all feel comfortable writing SQL, so any member of the team can jump into DBT and create a view of the data that is:

Highly tailored to their needs
Versioned-controlled in git
Testable using a large suite of pre-configured DBT tests (or we can write our own custom tests)
Compatible with the DBT lineage graph, so we can track and query complex dependencies
Automatic documentation generation

While we’ve already enjoyed a lot of success from DBT, we’re still early in our journey with the tool — and we’re excited to leverage it more in the future and further benefit from how quickly we can iterate with it. In the meantime, check out the following snippet of our lineage graph:

Key decisions that made us faster

A few decisions that seemed small in the early days of the stack have made our lives much easier as we’ve scaled. A few of these decisions were:

No dimensional data model for now. Early on, we knew that our data model and raw data would frequently change because the day-to-day of our business moved so fast. Because of this, we knew the data stack would need to thrive on this kind of frequent change. So, we decided not to make a traditional dimensional data model for our warehouse until we had fewer changes day-to-day.
Semi-structured data + ELT > structured data + ETL. Snowflake has excellent support for semi-structured data. Because of this and the decision to operate without an “official” dimensional model (see above), we chose to load mostly semi-structured data (usually JSON) into the warehouse and parse it into flattened views based on the currently accepted schema. This makes it easy, cheap, and not very risky to modify and adapt downstream views as the upstream data changes over time. Of course, this is not always possible, but it is the default behavior and has allowed us to move very quickly.
Views > tables. Coordinating scheduled jobs across different systems is a challenging problem in most data stacks. “When does this data get loaded?” is a question that is asked frequently in data team slack channels. We haven’t totally solved this problem, but we try to use views to handle data transformations as much as possible to minimize it. This allows new raw data to be available for querying as soon as it becomes available. Similar to the above, sometimes this isn’t possible for large or costly datasets but it solves the problem of coordinating scheduled jobs for many use cases.
Immutable logs & CDC > replication & snapshots. As mentioned above, we knew early on that our data was going to change a lot as the company grew. So, we decided to keep full histories for the majority of our datasets. This decision has saved us time, made it easier to adapt to new data sources and schemas, and unlocked insights that are only available with a full history.

Where we’re headed next

We’re really proud of the progress we’ve made so far, especially considering that we’ve been able to build this stack (and more!) in under a year. That being said, we have audacious goals for the future, including:

Building out a state-of-the-art, real-time data storage and processing stack to enable our app’s real-time ML and analytics use cases.
Revamping our data warehouse to scale with new datasets, use cases, and a whole lot more volume — ultimately enabling anyone at our company to use massive amounts of data in their day-to-day.
Embedding machine learning more deeply into some of the core components of our product (like live video discovery and fraud detection).

One last thing…

Thanks for reading — and before you go:

We are HIRING!

If you found this post interesting and you’re looking to join a data team that likes to put data science into production, head to our careers page and reach out!

Zachary Klein is a Software Engineer at Whatnot