Dandy Star Schema

We’ve taken the traditional star schema and modified it to be more modular. The result is smaller blocks that are easier to build with, debug, and move around; clearer interfaces; and fewer ambiguous staging tables.

Published in

Dandy Engineering, Product & Data Blog

4 min readSep 17, 2021

The traditional star schema consists of dimension tables (dim), fact tables (fact), and aggregation tables (agg). The rows in fact tables are typically records of a temporal occurrence (ex: fact_order) and the dimension tables capture lots of information about long-lived entities (ex: dim_customer). An aggregation table can select from fact or dim, but dim and fact can only select from staging tables. A dimension table like dim_customer can have many, many columns that are essentially aggregations of some sort (ex: total orders to date, total orders in first 3 months, etc) and that pull from staging tables with less defined organization.

At Dandy we’ve decided that the star schema can be made more modular. We can reduce the amount of information living in ambiguous staging tables, and we can break up dimension tables into pieces with clearly defined purposes and sources. Aggregations pertaining to a fact or dimension can safely build off of other fact/dim tables (which are clearly defined) rather than from staging tables that often mean different things to different people. More modularity, with clearly defined interfaces, also makes for less mistakes and very methodical debugging.

Additionally, for those who use Looker, more modularity in underlying tables allows for more modularity with Looker views. Smaller views can be easily moved around or added to different explores, meaning faster implementation times and less fields in a given explore.

At Dandy, we have the following table types:

Dimension (Dim)

(ex: dim_customer)

Our dimension tables are much lighter than the traditional dimension tables and contain few aggregations, if any at all. As is traditional, the rows in dimension tables represent entities that are long-lived. In the example of dim_customer, likely fields include: customer_id, linking id 1, linking id 2, name, address, joined_date.

The name of the table usually involves a singular noun rather than a plural one.

Fact

(ex: fact_order)

Fact tables typically record temporal occurrences. In the example of fact_order, likely fields include: order_id, customer_id, shipping_address_id, time_created, status, assorted product information fields).

The name of the table usually involves a singular noun rather than a plural one.

Aggregation (Agg)

(ex: agg_cohort_retention)

An aggregation tables takes information from fact and dim tables and aggregates it up somehow. There aren’t too many rules for them.

Aggregation that Extends a Dimension (AggDimExt)

(ex: aggdimext_customer_orders_summary)

(ex: aggdimext_customer_zendesk_conversation_summary)

An aggregation that extends a dimension takes in fact and dimension tables and aggregates them so that the final table can be joined one-to-one with a dimension table. For the example of aggdimext_customer_orders_summary, we can take the fact_order table and aggregate by customer_id. Fields that we aggregate to might include number of orders, first order date, and last order date.

In the naming of these tables, the name of the dimension is singular while the remainder of the name usually contains a plural of some form

Aggregation that Extends a Fact (AggFactExt)

(ex: aggfactext_invoice_invoice_line_items_summary)

The same ideas apply here as with aggdimext, but the aggregation query groups by a fact id instead of a dimension id. In this example, we are grouping the line items for an invoice up to the level of an invoice, but an invoice is itself a fact.

Date Spine Cross Joined with Another Table (DateSpineCross)

(ex: dataspinecross_cohort_customer_purchases)

The case for cross joining a date spine to another table comes up more often than one might expect. We give this type of table a special designation in our warehouse so that its structure is immediately clear to all who inspect the table.

Staging (Stg)

We avoid staging tables where not logically necessary — there is no need for an extra layer of complexity. Unlike in certain paradigms, we do not have queries for fact/dim/etc tables whose only logic is to cherrypick fields from an underlying staging table. We are also very comfortable having the queries for any fact/dim/agg/aggdimext/etc tables involve numerous CTEs (usually this setup is easier to copy into a db idea for debugging). Staging tables should only exist where there is a logical need for them.

Utility Tables (Util)

One of our use cases for utility tables is unusual mappings that we have between entities where we had to define the mapping logic on the data team side instead of having that mapping pre-defined in production data.