Choosing the Right Data Warehouse Modelling Approach

An Analytics Engineers Guide To Data Modelling Options

Andy Sawyer
11 min readJun 4, 2024

Imagine you work as an Analytics Engineer at a busy restaurant. Every day, customers make reservations, place orders, and complete payments. All of this data flows into your restaurant’s transactional database, capturing the details of each interaction.

But a transactional database, while great for running day-to-day operations, isn’t ideal for analysing data to uncover trends and insights that could help grow your business. That’s where the data warehouse comes in. A data warehouse is a separate database optimised for storing large volumes of historical data as well as fast querying and analysis.

DALL-E Generated Data Modelling Discussion

The challenge is structuring the data in your warehouse to enable efficient analysis while remaining flexible enough to handle changing business needs. There are several common (and less common) approaches to modelling data in a data warehouse. In this article, we’ll look at seven key modelling techniques, weigh their pros and cons, and help you choose the right approach for your data warehouse.

The Example Transactional Database

Before diving into data warehouse modelling, let’s briefly look at what a typical transactional database for a restaurant might contain:

  • Customer table with details like name, email, phone number
  • Booking table with reservation date, time, party size, etc.
  • Product table with descriptions, linked to a product prices and a product group table
  • Order table linking customers to their selected menu items
  • Order Detail Items table with the quantity of each menu item per order
  • Payments table with totals, payment method, etc.
Base ERD for Sample Restaurant

The transactional database uses a normalised structure optimised for processing transactions. But this structure makes analysis and reporting more difficult. A data warehouse modelling approach denormalizes and restructures the data to optimise for analysis in the gold layer, while still facilitating fast writes into the silver layer.

Data Warehouse Modelling Approaches

Third Normal Form (3NF)

Third Normal Form (3NF) is a classical relational database modelling approach that minimises data redundancy. In 3NF, each non-key column depends only on the table’s primary key.

Applying 3NF to our restaurant data warehouse, we would separate further the tables so there were no hierarchies within the individual table structures, but rather they were hierarchies as links of tables. Transactional tables like Orders and Payments would reference the primary keys of the metadata tables, which in turn would reference further higher level metadata tables.

Essentially, the change that we’d be making would be to identify that the customer table contains address details, that are a natural hierarchy. We’d then split these out into component tables. You generally wouldn’t use this approach for a presentation/gold layer, however it might be appropriate for a silver layer. This is especially true if you are ingesting multiple sources that might be structured natively at different levels of grain.

3NF ERD

Pros:

  • Reduces data redundancy and saves storage space
  • Enforces data integrity through primary key relationships
  • Provides flexibility to accommodate future changes

Cons:

  • May require many table joins for analysis, impacting query performance
  • Can be complex to understand and navigate for business users

You can find out more about 3NF here.

Kimball Star Schema

The Kimball Star Schema is a pivotal concept in data warehousing, introduced by Ralph Kimball, a prominent figure in the field of data warehousing and business intelligence. It is designed to streamline data storage and enhance query performance, making complex data structures more manageable and accessible for business analytics.

Key Components

  1. Fact Table: The fact table is the centrepiece of the star schema, containing quantitative data such as sales figures, transaction counts, and other measurable metrics. Each record in the fact table corresponds to a specific event or transaction and includes foreign keys that link to dimension tables. This table is typically large due to the volume of transactional data it holds. Commonly, it includes numerical measures like sales amount, quantity sold, and cost, along with keys referencing dimension tables.
  2. Dimension Tables: Dimension tables are peripheral tables that provide context to the data in the fact table. They contain descriptive attributes related to the dimensions of the business, such as time periods, geographical locations, products, customers, and employees. Dimension tables are generally smaller and more static compared to the fact table. They store data in a denormalized form to simplify queries, making them easier to understand and navigate.

Structure and Design

The star schema gets its name from its layout, which resembles a star (unless you use Mermaid to build your diagrams as code per the below example 🤦‍♂️). The fact table is at the centre, surrounded by dimension tables, each connected to the fact table through foreign keys. This design is intuitive, allowing users to perform straightforward joins between the fact and dimension tables.

In our restaurant example, we would have a central Orders fact table containing foreign keys to dimension tables like Customers, Date, Product, etc. The dimension tables contain redundant, denormalized data to avoid the need for further joins. This is a fairly standard approach for a gold or presentation layer in your warehouse.

Star Schema ERD for Orders Data

Pros:

  • Optimised for fast querying by minimising joins
  • Intuitive for business users familiar with the business process
  • Aggregations are simple to calculate

Cons:

  • Denormalization increases data redundancy
  • Handling slowly changing dimensions can be tricky
  • Less flexible for accommodating future changes

You can find out more about Star Schema design here.

Snowflake Schema

The Snowflake Schema is a variation of the star schema in data warehousing, designed to handle complex and highly normalized data structures. Named for its intricate, snowflake-like shape, this schema organizes data into multiple related tables, which enhances data integrity and reduces redundancy. However, this comes at the cost of increased complexity in query writing and potential performance impacts. It could almost be seen as a blend between Star Schema and 3NF.

Key Components

  1. Fact Table: The fact table remains the central table in the snowflake schema, containing quantitative data such as sales, revenue, or other business metrics. It holds foreign keys to the dimension tables, which provide context for the facts. Each record in the fact table represents a measurable event and includes numerical measures and foreign keys linking to the relevant dimensions.
  2. Dimension Tables: Unlike the star schema, the snowflake schema further normalizes dimension tables. Each dimension table can be broken down into multiple related tables, creating a more normalized structure. For example, our customer dimension has had the address split out into postcode, city and state tables, each linked by foreign keys. This normalization reduces data redundancy but increases the number of joins required for queries.

Structure and Design

The snowflake schema resembles a snowflake due to its multi-level structure. Dimension tables are normalized into multiple related tables, spreading out into a complex, web-like arrangement. This design aims to save storage space and improve data integrity by eliminating redundancy.

Snowflake Schema for Orders Data

Pros:

  • Reduces data redundancy compared to star schema
  • Enforces data integrity through normalisation
  • Makes dimension tables easier to maintain

Cons:

  • May require more joins compared to star schema, impacting query performance
  • Harder for business users to understand and navigate

You can find more about Snowflake Schema here.

One Big Table (OBT)

The One-Big-Table (OBT) design, also known as a flat table or wide table, is a data warehousing approach where all data is consolidated into a single, denormalized table. This approach contrasts sharply with normalised schemas like the star and snowflake schemas, offering a simpler structure at the expense of increased redundancy and potential performance issues with very large datasets.

Key Components

  1. Single Table: The core characteristic of the OBT design is that all relevant data is stored in one comprehensive table. This table includes a wide array of columns that capture all the attributes and measures necessary for analysis. The table might contain thousands of columns, with each row representing a unique record encompassing multiple dimensions and metrics.
  2. Attributes and Metrics: In the OBT design, attributes (typically found in dimension tables in other schemas) and metrics (typically found in fact tables) are combined into a single table. For example, customer details, product information, and sales figures would all be stored in the same table, with each record providing a full snapshot of a transaction or event.

Structure and Design

The structure of the OBT design is straightforward, involving a single table where each row includes all the necessary data points. This flat structure eliminates the need for joins, making it simple for users to query and retrieve data without understanding complex table relationships.

For our restaurant, we would have one large table for each of the three key events — bookings, orders and payments. I’ve seen some fairly heated debates on whether OBT or Star Schema is the better approach. The answer is ‘it depends’. If you’re pulling your data into Power BI, it will expect a Star Schema style dataset. If however you’re pulling the data into Tableau, you might prefer a OBT approach. It should however be noted that if you do opt for OBT, you should keep any reusable logic in supporting tables that can be referenced over and over.

OBT Example

Pros:

  • Extremely fast query performance with no joins required
  • Simple to understand with all data in one place
  • Good for Data Science consumption

Cons:

  • High data redundancy, requiring significant storage space
  • Hard to make changes without altering the entire table structure
  • Queries with many columns can be cumbersome to write and maintain

You can find more about OBT here.

Data Vault 2.0

Data Vault 2.0 is a modern approach to data warehousing that aims to provide a scalable, agile, and auditable data model. It was developed by Dan Linstedt and builds on the original Data Vault methodology, enhancing it to better support today’s complex data environments. Data Vault 2.0 addresses the need for handling big data, unstructured data, and a variety of data sources while maintaining data integrity and historical accuracy. As with 3NF, this would sit in your silver layer. It’s not something you would point a BI tool at.

Key Components

  1. Hubs: Hubs store unique business keys with a unique surrogate key and metadata like load date and source information. Each hub represents a core business concept, such as a customer, product, or order. Hubs are highly stable and rarely change, providing a consistent reference point for the data warehouse.
  2. Links: Links capture the relationships between business keys stored in the hubs. Each link table contains foreign keys to the relevant hubs, along with metadata. Links represent transactions, associations, or hierarchies between entities. They are used to model many-to-many relationships and changes in relationships over time.
  3. Satellites: Satellites store descriptive attributes and context for the business keys in hubs or relationships in links. They include metadata for tracking changes over time, such as the load date and source. Satellites can evolve without impacting the core business keys and relationships, allowing for flexible adaptation to new business requirements.

Structure and Design

Data Vault 2.0 is designed with a modular architecture that separates data storage into these three types of tables, promoting scalability and flexibility. Hubs, links, and satellites are designed to be loaded incrementally and in parallel, allowing for efficient handling of large volumes of data and changes over time.

Data Vault ERD Example

Pros:

  • Designed to handle changes in business requirements over time
  • Separates structural changes from informational changes
  • Provides traceability and preserves history

Cons:

  • Can be complex to design and implement
  • Requires many tables and joins to assemble data for analysis
  • Queries can be difficult to write and understand

You can find more about DV2.0 here.

Activity Schema

The Activity Schema is a data warehousing approach designed by Ahmed Elsamadisi to capture detailed records of business activities or events in a structured and efficient manner. This schema focuses on documenting the actions or transactions performed within a business, making it particularly useful for event-driven data and detailed transaction logging. It has been used in systems that need to track a high volume of granular events, such as e-commerce websites, financial systems, or IoT applications.

Key Components

  1. Activity Table: The central table in the activity schema is the activity table, which records each business activity or event. Each row in the activity table represents a single event or transaction, capturing details about what happened, when it happened, and other relevant context. This table attributes have been defined as part of the standards, so are easy to implement.
  2. Dimension Tables: Attached to the activity table is an optional dimension table that provide additional context for the events recorded in the activity table. The dimension might include details about customers, products, locations, time, and other relevant entities depending on the activity stream it relates to.

In our restaurant example, we might have a customer activity table with associated customer dimension. The activity table would track the activities of the customer, such as their bookings, orders and payments. The details of these are held within the feature_json column, with an optional column to store the revenue impact where relevant.

Activity Schema Example

Pros:

  • Simple and intuitive design with a very small number of tables
  • Additional activities by an entity can be captured without schema change
  • New tables are only required when new entities need to be tracked

Cons:

  • Relatively new approach without widespread adoption yet
  • May not be suitable for all business domains and analysis needs

More information on this approach can be found here.

Entity-Centric Modelling

Entity-Centric Modelling is a flexible approach proposed by Maxime Beauchemin focused on modelling around entities like Customers and Products. Each entity gets its own table, with json used to track an assortment of metrics including aggregates. This approach does not require additional dimension tables as the entity tables are at the lowest grain of the entity, and can contain their attributes directly in the table.

In a restaurant context, we’d have a Customers table with columns for the customer attributes, plus a json column to hold time bound metrics such as visit.7d, visit.14d, sale.30d.

Entity-Centric Modelling

Pros:

  • Flexible and adaptable to changing business requirements
  • Simple to understand with a small number of tables
  • Captures history effectively within the metrics column

Cons:

  • Queries can be complex, often requiring unnesting of semi-structured data
  • Challenges when entities have overlapping attributes or behaviour types
  • More difficult to enforce integrity constraints compared to star schema

You can find out more here.

Choosing the Right Modelling Approach

With these seven common data warehouse modelling approaches in mind, how do you choose the right one for your data warehouse? Consider these factors:

  • Analytical requirements: What kinds of questions will you need to answer? Choose a model optimised for those query patterns.
  • Data volume and scalability: Consider how much data you have now and expect in the future. Some approaches scale better than others.
  • Ease of use: Think about who will be querying the data warehouse. Some models are more intuitive for non-technical users.
  • Flexibility: Your business is likely to evolve. Choose a model that can adapt to changing requirements.
  • Performance: Consider the trade-offs between query speed and data redundancy. Denormalized models are often faster but require more storage.

In the end, the “right” data warehouse modelling approach depends on your unique business needs and priorities. By understanding the strengths and weaknesses of each modelling technique, you can design a data warehouse that delivers fast, flexible, and insightful analytics to drive your company growth.

--

--

Andy Sawyer

Bringing software engineering best practices and a product driven mindset to the world of data. Find me at https://www.linkedin.com/in/andrewdsawyer/