BigQuery + Dimensional Modeling: The Perfect Match — Part 2 of 2

Learn how BigQuery with Dimensional Modeling can help you better store, analyse, and understand your large datasets.

Lee Doolan

Published in

Appsbroker CTS Google Cloud Tech Blog

7 min readAug 17, 2023

BigQuery + Dimensional Modeling: The Perfect Match

Continued from Part 1

In this second part of a two-part blog post, we will follow up on Part 1’s Dimensional Modeling overview and see how alongside Google Cloud’s BigQuery we can support, enhance and expedite information delivery.

A reminder of this two-part blog series objectives are:

Part 1: Discuss what dimensional modeling is, the benefits and value it can add to an organisation, and why it’s worth the development effort.

Part 2: Illustrate how using BigQuery’s features with a dimensional model can maximise the value from your data platform investment.

So following on with one of CTS’s values, let’s Crack on with Part 2 . . .

How BigQuery supports Dimensional Modeling

Now we have an overall idea of what dimensional modeling is and how it helps us from a business, governance and information delivery point of view, let us now introduce BigQuery.

I will demonstrate how we can use BigQuery to add further value, cost savings and flexibility.

Power, Performance & Storage

Dimension Modeling Context:

I described how we should aim to use the lowest Fact table grain for ultimate flexibility, and also not to be frightened to duplicate data in our dimension tables.

I also described how the star schema is a well-optimised model built for analytical reporting.

BigQuery Support:

This of course means we are storing and having to process ‘more’ logical data.

However this isn’t a problem for BigQuery where it includes features like these to help . . .

Columnar storage: This allows BigQuery to only read the columns that are needed for a query which can significantly improve performance.
Automatic compression: BigQuery automatically compresses data as it is loaded, significantly reducing storage costs and improving performance. The storage format in BigQuery is called Capacitor and with Columnar storage can achieve enormous data compression ratios depending on types of data stored.
Distributed computing: Using distributed computing architecture, queries are executed across a large multi-tenant cluster called Dremel. This allows BigQuery to scale out queries into slots (i.e. a virtual CPU) and handle large volumes of data.
Caching: Queried data is stored in temporary cached results table meaning it does not need to be re-queried each time it is accessed. Note the limitations of this feature here.
Pre-computed aggregates: Aggregates such as sums and averages can be pre-calculated and stored in memory and recalled from the cache, using BI Engine or Materialised Views.
Partitioning: Data can be partitioned into smaller chunks, meaning only required partitions are loaded into memory.
Query Optimisation: If the tables involved in the query are part of a star schema and have unenforced primary/foreign key constraints in place, BigQuery will use these hints to create a better query plan to improve query performance.
Nested and Repeated Columns: When using BigQuery struct and array structures, you have the opportunity to reduce the number of rows and physical data size in your models. This all helps with performance.

Be cautious when using nested and repeated columns, as this can reintroduce a layer of complexity that some end users and reporting tools may find difficult to understand.
Although it may seem a great way technically to save space and improve performance, say adding nested Order Line data to an Order fact table, is the space saving really worth it? Would a separated Order Line fact table be simpler to query and understand, and manage from an ETL point of view.
An example of the complexity of querying such structures can be found in one of my previous blogs here, and this blog doesn’t even cover how you would make updates to such structures!

Costs

Dimension Modeling Context:

I described how we will be storing a lot of data, often historical, over many years.

I described how we should choose to implement different layers of data quality and structures.

BigQuery Support:

Storing more data could potentially equate to processing and querying more data, and in a cloud based data platform, that could lead to increasing cost.

Luckily BigQuery can help mitigate against higher cost with these features . . .

Separation of Storage & Compute: This is an important concept where storage and compute are each scaled and subsequently costed independently.

In practise this means:
a) You can save money on storage costs, especially if you have a lot of data that is only accessed occasionally. This is improved by partitioning your data.
b) You can save money on compute costs, especially if you only need to run very occasional queries.

Storage Cost Tiers: Data Storage costs reduce as data within a table or table partition ages. If no table or partition is modified for 90 consecutive days it will transition from active to long-term storage and its storage costs drop by approximately 50%.
Storage Billing Models: Google has recently allowed customers to opt for billing based on compressed storage sizing, meaning the amount they pay is determined by the physical storage space taken up by their data. Note: By default and unless changed customers will pay for their data storage at its ‘logical’ size, i.e. for the `uncompressed` space.

Ash Broadley, a colleague here at CTS, has written an excellent demonstrative blog here, on how opting for compressed storage billing can significantly reduce your data storage costs for larger tables, especially.

Flexible Compute Pricing: BigQuery offers several pricing models based on On-Demand (priced by TB) or Capacity (priced on slots/hour and Edition). The right model can be selected based on your organisation’s specific workloads and budget.

In an earlier blog post here, Ash Broadley also covered the various compute pricing options on offer.

Security & PII

Dimension Modeling Context:

When detailing how utilising dimensional modeling will apply data governance, I described how it can help prevent unauthorised access to sensitive data when we have a clear definition of data ownership and access rights.

BigQuery Support:

BigQuery can help secure data and keep GDPR compliant through various functions such as:

Access to specific Tables / Views: This can be done directly for a user or group using IAM, or using Authorised Views for specific columns in a table and preventing access to underlying data.

These can be particularly useful when giving access to individual Fact or Dimension tables in a dimensional model.
Preventing some users from accessing an employee Dimension table may be required in some cases, for example.

Column Level Access: We can provide fine-grained access to sensitive columns using policy tags. This could include preventing access to the column completely, or masking sensitive data.

This is an excellent way to allow users access to tables without subsequently being able to view or share sensitive data.
Also by allowing access to masked sensitive columns, users can still perform distinct counts for summary reporting or negate the need to alter SQL queries.

Row Level Access: Policies can be added to a target table to filter rows that will be made available to a query by a particular user or group.

Row level access is a great way to restrict access to rows in a dimension or fact table based on a business function or geographical location, for instance.

Encryption: Encryption is applied by BigQuery on your data both at rest and in-transit, using keys managed by Google or by the customer.

Other BigQuery Functionality

Dimension Modeling Context:

I described how dimensional modeling tables are joined together and a star schema is formed.
I described how critics of dimensional modeling believe it is not flexible enough to store or model unstructured data sets.

BigQuery Support:

BigQuery has added functionality that further helps building a dimensional model such as:

Primary / Foreign Keys: Although currently unenforced (be careful!), these can be used by the BigQuery optimiser as a hint to optimise joins in SQL queries.

Hint: Place these on your star schema fact and dimension key columns.

JSON Column: Include semi-structured and unknown schema data in your models using the JSON column data type.

Now we can include a varying amount and variety of data within our fact and dimension tables.
Want to store products that have varying amounts and types of attributes? Use core attribute fields for known columns, and a JSON column to capture the variations.

External Tables: Query data sat in physical files in Cloud Storage locations using external tables.

An excellent way to deliver fast, or real-time data within your data model.
Give access to a governed external table via a view, allow users to query data that’s just dropped in, in real-time, and join onto batch produced conformed dimension tables.
Materialise physically in BigQuery later? Or not? Your choice. Data Lakehouse anyone?!

Closing Thoughts

Thanks very much for reading this two-part blog post.

BigQuery is a modern high performing data platform enabling you to deliver information to end users with little or no data modeling at all. But I do hope I’ve demonstrated how the use of dimensional modeling can add maturity and value to your data, maximising your well chosen investment in BigQuery.

Dimensional modeling requires significant initial analysis and effort, and it may not be appropriate for all data initiatives. However, in a broader data analytics strategy where high data quality and assurance are expected, it should certainly be given serious consideration.

While a dimensional model has its critics, I hope I’ve demonstrated that you can use it as much or as little as you want alongside BigQuery. This approach allows for some flexibility beyond Kimball’s original text!

It really is an excellent way to share an understanding of your organisation. Happy modeling!

Thanks

Finally, big thanks to my CTS colleagues Sourygna Luangsay, Jenn Calland, John Colman & Ash Broadley for their help in this publication.

About CTS

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!

BigQuery + Dimensional Modeling: The Perfect Match — Part 2 of 2

Learn how BigQuery with Dimensional Modeling can help you better store, analyse, and understand your large datasets.

Continued from Part 1

How BigQuery supports Dimensional Modeling

Power, Performance & Storage

Dimension Modeling Context:

BigQuery Support:

Costs

Dimension Modeling Context:

BigQuery Support:

Security & PII

Dimension Modeling Context:

BigQuery Support:

Other BigQuery Functionality

Dimension Modeling Context:

BigQuery Support:

Closing Thoughts

Thanks

About CTS

Written by Lee Doolan