BigQuery + Dimensional Modeling: The Perfect Match — Part 1 of 2

Learn how BigQuery with Dimensional Modeling can help you better store, analyse, and understand your large datasets.

Lee Doolan

Published in

Appsbroker CTS Google Cloud Tech Blog

11 min readAug 10, 2023

BigQuery + Dimensional Modeling: The Perfect Match

Introduction

Google Cloud’s BigQuery is a powerful cloud data platform that can be used to store and analyse large amounts of data. Dimensional modeling is a data modeling technique used to organise data, making it more accessible and useful.

In this two-part blog post, we will:

Part 1: Discuss what dimensional modeling is, the benefits and value it can add to an organisation, and why it’s worth the development effort.

Part 2: Illustrate how using BigQuery’s features with a dimensional model can maximise the value from your data platform investment.

So following on with one of CTS’s values, let’s Crack on with Part 1 . . .

What is Dimensional Modeling?

Dimensional modeling is a data modeling technique first developed by Ralph Kimball in the 1980s, and made popular in the early 1990s after he published his book: The Data Warehouse Toolkit.

Ralph’s modeling technique involves structuring data into the shape of a star, aptly named the star schema, which is simple to visualise and easily understood by non-data technical business users.

This process involves making tables that represent known business details and activities.

Measurements such as ‘how much’ or ‘how many’ are placed in central fact tables.
Information like ‘who, what, when, and where’ is put into related dimension tables.

https://durofy.com/data-warehousing-tutorial-5-star-schema-and-snow-flake

Why is Dimensional Modeling useful?

Dimensional modeling is a mature technique proven to be effective. The presence of dimensional modeling can be an indicator of data maturity in an organisation, often leading to:

Good and robust information delivery & reporting.
Promoting data governance.

Information Delivery & Reporting

Dimensional modeling enhances business performance insights, aids in making predictions, and plays a role in decision-making.

Features that help deliver these data analytics include . . .

Simplicity: An easy-to-understand model making it easy for business users to understand data and make decisions.
Reduced Analytical Effort: Analysts no longer need to know source system structures, nor need to write complex queries, nor repeat business logic.
Performance: The star schema is a well-optimised model built for analytical reporting, meaning queries can be executed quickly and efficiently.
Consistency: We now have a middle layer between source systems and final reports. Analysts are largely protected from system changes with a model containing consistent hierarchies and business logic. We can also more easily swap out or use multiple reporting tools.
Flexibility: The star schema is a flexible data model that can be used to store and analyse a wide variety of data and help an organisation respond quickly to change.

Data Governance

Utilising dimensional modeling will apply data governance, even without a recognised corporate program.

Whilst providing a consistent and standardised way to store and manage data, this will involve improving data quality, security, compliance, and retention . . .

Data Quality: Whilst shining a light on ‘bad’ data through improved data integration and availability, dimensional models can be used to define data quality metrics, which measure the accuracy and completeness of data. This information can be used to identify and correct data errors.
Data Security: By clearly defining data ownership and access rights, dimensional modeling will prevent unauthorised access to sensitive data, thereby protecting an organisation from data breaches and security incidents.
Data Compliance / Lineage: With information delivery centralised from a known model, we are more able to determine exact lineage of where, how data has been created, modified, and used. This information will ensure compliance with data privacy regulations, such as GDPR.
Data Retention: As data is archived or pruned in source systems, or systems decommissioned completely, we now have a structured model to store this data indefinitely. Fact tables will often contain data sourced from old and disparate systems, for example.

Dimensional Modeling Detail

Whilst I don’t want to go into too much detail, I do want to add some further context which will help in Part 2, when we discuss BigQuery features and support.

Note: Although Slowly Changing Dimensions and Surrogate Keys are essential parts of star schema design, they are not particularly relevant to this blog post. However, I strongly recommend researching dimensional modeling in more depth.

There’s plenty of great resources out there, including Ralph Kimball’s original book and website.

Initial Approach

There is a 4 step process that Kimball recommends you should do before finalising which modeling techniques you are going to use, and physically building your dimensional model.

This ensures you end up with a model tailored for your specific business requirements that is easy to understand by your users.

This 4 step process is as below. This will ultimately determine which fact and dimension tables you end up with, and how they relate to each other.

Fact Tables

The fact table is the central table in the star schema. It contains the measures of interest, which are typically numeric values that can be aggregated. For example, the fact table in a sales data warehouse might contain measures such as sales amount, quantity sold, and profit.

The key element of a great fact table is selecting the right granularity.

Don’t be tempted to aggregate measures into a smaller set of rows, or just to deliver information for a summarised dashboard and/or report.

Where you can, keep the grain as low as possible, aligned with the business transaction that created them. This gives you the flexibility to drill up, down and across, answering questions and delivering metrics your analytics team haven’t even thought of yet.

An obvious example of this concept is by not creating a fact table representing a `Sales Order`, but choosing to create one representing a `Sales Order Line` instead.
This then gives you the ability to report sales by products too! By choosing the lowest grain, no data or flexibility will be lost, keeping your data analysts and scientists happy.

Dimension Tables

The dimension tables provide context for the measures in the fact table. They contain descriptive data that helps to explain the measures. For example, dimension tables in a sales data warehouse might contain information about products, customers, and time periods.

Dimension tables are the secret sauce of a dimensional model. Ideally they should align to the entities that everybody knows, cares about and uses within your business.
Including hierarchies in your dimensions that represent how your business or product offering is structured is really important. Being able to drill-up and drill-down geographical locations of where sales are transacted is a common requirement for example.

Don’t be tempted to snowflake off your dimensions for data redundancy/duplication reasons i.e. normalising. Duplicate your data across rows i.e. denormalisation.
This adds complexity to your SQL queries as well as further join requirements and potential performance issues.
Keep it simple, stick to the star.

A key element is trying to ensure your fact tables are at a low enough grain to make use of as many common dimension tables as possible.

By reusing dimension tables, containing all your agreed and known business logic, attributes and hierarchies, it ensures analysis is kept consistent and trusted.
These reused tables are called conformed dimensions, and there is a great video here detailing this concept.
If you can simply ‘bolt-on’ a new fact table to existing dimension tables you get all the previous development work for free!

Why not also include how you want data to look like in your reports as attributes in a dimension table too?
For example, including a clean Financial Year label attribute displaying ‘FY 2022/23’ as opposed to just a number saying ‘2022’.
Yes you could do this simple formatting in your reporting tool, but why not reduce your end report development effort and encourage consistency.

Table Relationships

The relationships between the dimension tables and fact table are one-to-many, meaning each row in a dimension table can be linked to multiple rows in the fact tables.

Every fact table row should always be linked to a row in a corresponding dimension table. This means all SQL queries can be written as inner joins.
If a dimension entity is unknown when a fact is loaded, it should be mapped to a dummy row or perhaps a ‘stubbed’ record. This ensures data is never missed by an erroneous join caused by a missing value. Some techniques to handle these scenarios can be seen here.

An example of a simple SQL query written against a star schema is as below:

-- How much sales revenue did we receive for books,
-- in our UK stores, per City?
select s.Store_City, sum(f.Sales_Amount) as Total_Sales_Amount
from FactSales as f
inner join DimDate as d on d.Date = f.Sales_Date
inner join DimStore as s on f.Store_Key = s.Store_Key
inner join DimProduct as p on f.Product_Key = p.Product_Key
where d.Calendar_Year = 2022
and s.Store_Country = 'UK'
and p.Product_Type = 'Books'
group by s.Store_City

This pattern of always using inner joins, and knowing joins will always return all data, is critical.
It simplifies SQL queries for your analysts and self-service reporting tools, as well as being more likely to be optimised by your data platform.

The absence of a ‘real’ value to join to a dimension table when a fact is stored, is information in itself and should be retained.
Mapping onto a ‘dummy’ dimension row will give additional context and visibility in end reports.
For example, we would still want to count sales where the geographical location of the transaction was unknown. Often business users seeing a report item labelled ‘Unknown’ can prompt data quality initiatives and action.

Dimensional Modeling Criticism

Whilst I have listed many benefits of implementing a dimensional model above, it does have its critiques, especially in a big data world!

You don’t have to trawl the internet very hard to find blog posts and articles detailing why an author believes Kimball methodology is now outdated and no longer needed.

Is dimensional modeling dead? Not quite yet!

Of course I don’t believe it’s quite dead yet and in the interests of balance, some examples of criticism are listed below, with my thoughts . . .

Kimball methodology was conceptualised at a time when storage and compute was expensive and couldn’t be separated. However, that is no longer the case.
Whilst this statement is partially true, we can now store more data than we can possibly envisage ever being useful, but surely this opens up an opportunity to do both performantly and cheaply?
Why not implement a data lake, augmented with a dimensionally modelled data warehouse to answer 80% of your organisations everyday and core information needs?
Compute is so cheap now we can make use of a ‘schema on read’ approach to data querying and no longer need to model data.
Although I agree with the concept of ingesting raw data with light transformation, and yes it is possible to query as-is, in most cases the ‘schema on read’ approach is simply pushing responsibility of data and its quality to downstream processes.
Data inconsistency, performance, reliability and quality all suffer and eventually pile up over time, with duplication and complexity of effort multiplied. If you don’t want to materialise data then maybe put views over the top mimicking a star schema at least, but be aware of analytic costs of this approach.
A star schema is not flexible enough to store or model unstructured and/or rapidly changing data sets.
I agree traditional dimensional modeling as proposed in Kimball’s original text didn’t and couldn’t envisage the way data sources and formats change frequently today, so a more modern, hybrid approach to Kimball methodology can be used to mitigate.
Using the modern ELT paradigm, the possibility to change, rebuild and reload data into data models has never been as fast or cheap. Making use of federated views and functionality to query, flatten and make data in unstructured formats more readily available is key, and luckily enough tools like BigQuery & Dataform can help with that.
Storage is now cheap, joins are bad for performance. Why not create One Big Table (OBT) instead?
I fully agree that delivering data with fewer joins most definitely improves query performance, and offers far simpler queries. But . . .
An OBT misses out on all the data quality, governance and consistency benefits that dimensional modeling offers. How do we track ‘missing’ data and null values etc?
Yes joins are bad, but unless you are joining multi-million row dimension tables to multi-billion row fact tables in most cases it’s not really a problem. Yes for certain use cases an OBT may be the answer, but why not use both? A dimensional model can be used to feed/reload an OBT on demand for instance?

Criticism Summary

Whilst I believe many of the authors’ arguments are productive, and their arguments valid based on their own experiences, I believe actual workload and organisation requirements should ultimately determine whether dimensional modeling (or any other other modeling) is applied (or not).

It should never be an all or nothing consideration.
With modern data platforms, we have the flexibility and speed to implement multiple solutions, or bits of each, etc.

A modern approach promoted by Databricks for instance, is a 3 layered data lake approach, where data is transformed through different states and models, with the final Gold layer often implementing a dimensional model.

You can read more about this ‘Medallion Architecture’ approach here and here.

https://www.linkedin.com/pulse/solving-rising-demand-self-serve-business-analytics-medallion-bahl/

Closing Thoughts — Part 1 of 2

Thanks very much for reading Part 1 of this blog post and I invite you to read Part 2 here.

I’ve tried to give a good overview of what Dimensional Modeling is, where you can learn more, and where it may potentially sit in your organisation.

I hope it’s a good primer for Part 2 where we will see how alongside Google Cloud’s BigQuery we can support, enhance and expedite information delivery.

Dimensional modeling really is an excellent way to share an understanding of your organisation.

About CTS

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!