Same Data, Sturdier Frame: Layering in Dimensional Data Modeling at Whatnot

Published in

Whatnot Engineering

8 min readMar 13, 2023

Alice Leach, Lalita Yang, Stephen Bailey | Data Engineering

A central design goal of Whatnot’s data platform is to allow anyone in the company to build data products on their own — fast. In a blog post last year, we shared the decisions that have helped us move fast, including “No centralized dimensional data model!”

Well, the company doubled, then tripled, and quadrupled in size. Build times got slower. Ownership got messier. Usage became ubiquitous, for a variety of purposes. We looked at solutions to address these growing pains and — lo, and behold, a centralized modeling layer, following Kimball principles was the most promising.

In this blog post, we’ll talk through the arguments for centralized data modeling, and key decisions we made when implementing our solution.

Motivation

For the past year, we have provided data users the freedom to build their own modular data marts in Snowflake, as long as those assets are created within the platform’s Snowflake dbt project.

That approach has been a huge success. We see a quarter of the company querying tables directly in Snowflake, and much more consuming data in visualization tools. By keeping business logic in a centralized project, we have been able to better react to quality issues and update logic as the application changes.

But as we’ve grown, we’ve also seen issues:

A need for multi-purpose, business-aware models to drive metrics and app experiences
Show-stopping performance challenges, especially for user behavior data
Unclear ownership of shared models

We were also experiencing a growth in the generation and usage of events data that provided alternate ways to calculate key metrics. What we needed was an intermediate layer where we could publish public, long-term stable models that could provide a source of truth for critical metric calculations and support a large variety of use cases — including user-facing ones — while remaining performant. A Kimball data mart serves this purpose nicely.

This “core” set of models acts like a paved road for users: if they follow the path, their journey should be safer, shorter, and faster. But we won’t stop adventurous users from going off the path if they want to.

Kimball Design Decisions

To start, we identified the top business processes driving data usage. Naturally, this led us to marketplace fundamentals — orders, users, listings — but also to Whatnot’s unique take on social commerce: livestreams, followers, reactions, messaging, and the like.

With the initial use case in hand, we had to make a series of key design decisions around just how formally we wanted to model the business. Below are a few of the decision points.

Data Model Types

We introduced guidance for data developers on the different types of models we expect to have in our core schema. These are the building blocks for our downstream reporting.

Facts are (generally) transaction-like records that should not change after the transaction is completed. They tend to be high velocity (there are a lot of them) and have measurements associated with them, such as price or duration.
Dimensions describe the objects in a fact table. For example, many orders might come from the same livestream, so the livestream would be considered a “dimension” of the order.
Bridge Tables — map two entities together when they have a one-to-many or many-to-many relationship. For example, we will have a category dimension and a livestream watches fact. Live streams can have many categories.
Map Tables — Different from bridge tables, map tables can be thought of as upstream of facts and dimensions. These are intermediary tables that map an ID to a categorization.

Slowly Changing Dimensions

Until this project, data users did not have an easy way to look at the attributes of an entity at a certain point in time. This is incredibly important to us as a lot of our offering is time-dependent on show availability, expired offers, nail-biting auctions, and timed giveaways. Slowly changing dimensions (SCD) are dimensions with tracked changes using valid_from and valid_to columns. For example, a user’s address can change over time and each time a change occurs, a new record is created (Type 2 SCD).

Natural Keys vs Surrogate Keys

Until this project, data users relied on a model’s natural keys to join, which is intuitive and simple, but sometimes confusing — string or int? — and, with the introduction of SCD, could lead to exploding joins.

We decided all data models in our core schema would have a varchar surrogate key. We generate this for each table using dbt_utils.generate_surrogate_key([entity_id, valid_from]). By always using a hashed surrogate key, we avoid potential integer-to-varchar join issues. The downside to surrogate keys is that they can be confusing for developers, who are used to joining on “id” columns rather than “keys”.

Null Dimensions

With modern data warehouses, referential integrity violations are not as problematic, but we wanted to follow Kimball’s best practices and avoid null foreign keys in our facts. With each of our dimensions we union a row for a null natural key with default_null_key as our surrogate key. Some have recommended using -1 as the default null surrogate key, but we decided to be explicit with our naming.

Long-term stability

We want developers to build production applications on top of this data — to treat these core models as they would any reliable versioned service. One responsibility, then, is to make sure the schema is very stable and changes only with significant warning and time. New columns are added with care and only when we are certain the underlying data-generating processes are equally robust.

Performance and Developer Speed

Alongside model design is model performance. Previously, our approach was to run models on a daily batch cadence, with little optimization. With our growth, finding ways to improve performance, reduce waste, and create tooling to allow developers to avoid thinking about performance, has become more front-of-mind.

Incremental processing

The major performance bottleneck was the inconsistent application of incremental data loading. Coming up with a uniform solution to incremental processing across a whole dbt project turns out to be a tricky problem that is as much art as science, especially when supporting applications with different freshness SLAs.

In addition to incrementalizing existing models, we made a major change by moving the upstream “staging” models for our event logs into their own dbt project, so that they could be orchestrated and managed independently of our self-serve marts. This allows for deeper integration with our ingestion processes. Dagster’s dbt integration and modular code deployments have been very useful in supporting the transition to a multi-dbt project world and making the change a gradual one from an end-user perspective.

Continuous integration/deployment

With fifty contributors to our main dbt project and close to ten merged pull requests a day, introducing a robust CI/CD process quickly became a high priority. To validate developer code, we use a CI run in a staging environment triggered by a GitHub Action. This is based on dbt’s Slim CI concept and orchestrated via the Autoscaling CI job from the dbtc package. With many pull requests in progress at the same time, it was challenging to ensure that changes were not passed to dbt’s state:modified argument between CI runs in error. To address this, we use a GitHub Action to merge the main branch into each development branch before the CI run is started. In addition, after each change is merged back into the main branch a CD job is started, which runs dbt compile in our production environment. The CI run defers to this job.

While the CI run can catch many issues in development, there are some bugs that slip through. To maintain the quality of models in our repo, we introduced dbt-checkpoint, formerly known as pre-commit-dbt. There are an array of useful tests that can be performed here, however, at a minimum we suggest including check-script-has-no-table-name, which identifies any hard-coded references in the modified models. While we have found dbt-checkpoint’s tests to be invaluable, we have run into a couple of bugs associated with parsing the models themselves. For this reason, we recommend using dbt-checkpoint with a linter like SQLFluff to minimize noisy errors that occur due to SQL formatting.

Removing Old Models

In the world of self-serve data, last week’s critical report is this week’s old news. While not the most glamorous part of data engineering, cleaning up old, archived, or deprecated data can prevent performance issues and information overload. dbt does not automatically drop objects from your schema when they are removed from the dbt project.

To remedy this, we implemented a weekly job to drop deprecated models based on dbt’s delete_outdated_tables() macro. This compares objects in the dbt managed Snowflake schemas with objects in the dbt project, notifies users via Slack that they will be dropped, and then, if no action is taken by the user, drops the tables from Snowflake.

We have now dropped hundreds of dead tables via this job and, by extension, identified many out of date references in existing downstream analyses. If you are developing fast, we encourage you to clean up fast too.

Outcomes

With the implementation of the changes described above, we saw build times for the core models drop dramatically. In tandem, the stability of our daily project build has increased. Data can be an incredible asset for the business: if it’s clean, performant, and organized, then new developers can integrate with business processes, new product features, and each other. But every new addition creates more weight, and it’s the data platform’s job to bear it.

As we scoped and planned, we kept adoption top of mind. Kimball data modeling is more complex than most users are accustomed to: thinking through facts and dimensions, the difference between warehouse keys and entity ids, and separating aggregations from the building blocks. But for developers of data products, rather than consumers, it makes building sophisticated data products much faster and more sustainable.

As important as the time savings is the clearing of space for future improvements. By establishing an interface that’s centrally supported, the data platform team can:

support a tight SLA on the delivery and quality of data assets
expose point-in-time data attributes (rather than summaries)
allow end users to define purpose-built data aggregations

And, of course, the data platform team can also reduce their management surface area, allowing our team to continue to move as fast as our developers.

If you’re interested in working on a data team doing real-time analytics, deploying machine learning services at scale, or innovating with data platform design, check out our careers page!