Part 2 — Modernising a Data Platform & BigQuery Concepts

Published in

Data Knowledge Hub

7 min readJul 11, 2020

In Part 1, we had a brief introduction to the concept of modernisation and key concepts of a traditional DWH.

In this article, we will talk about design concepts of a data warehouse.

The storage and compute form the core of a DWH design and for a better performance and optimised usage of resources, the DWH schema design plays a major role.

In case you are wondering, why are we even reading about the design and processing concepts, my answer to that is — if we are to talk about what makes a DWH modern, we need to understand what it is comprised of.

Quick recap of Modernisation definition — Adopting to the recent/modern practices, without deviating from the core concepts or processes.

There are quite a few renowned methodologies that describes how to design a DWH.

This is like teaching my son, how to write alphabets. There are quite a few techniques to teach kids, how to write. In case you are actually curious, please read this link (https://www.momjunction.com/articles/fun-ways-to-teach-your-toddler-to-write-better_0082821/). All the techniques are effective, but we choose one based on the kid’s interpretational abilities.

Similarly, all design techniques are effective, but we choose something that is more aligned with our requirements. Just to name a few, below are the commonly adopted DWH design methodologies:

Meyer and Canon methodology

Anahory and Murray methodology

Kimball methodology — A bottom-up approach for design, development and maintenance of DWH

Most commonly adopted methodology is Kimball and it’s also referred as KRRT98.

Let’s get into the shoes of a DWH designer. It’s fun!

Let us assume, we are building a library, where we intend to collect books that are scattered across different rooms and bring the ones that are in a good shape and form into one shelf that aims to provide easier and one stop access to the readers. Before determining the size of the shelf, we need to size and list of the books that are needed to be brought into the shelf.

So, how do we solve this puzzle? We can start with understanding about the end users of the shelf?

We know that the shelf is being designed for ease of access to the incoming readers. Let’s start with determining the types of readers who visit the library and particularly could visit that centralised shelf?

1. Readers who pick the books on rent

2. Readers who purchase the books

3. Readers who read in the library and pay per hour etc.,

Now that we know, who our visitors are, we can think about the categorisation of the books that we have! Like..

1. History

2. Adventure

3. Science

4. Fiction etc.,

Checklist :
User categories — Done
Book categories — Done

Finally, we can think of an order in which the books can be placed in the shelf and to determine that, we will need a few key metrics that can help us in making that decision. Few I can think of are mentioned below:

1. Country of origin

2. Time period in which the books were authored

3. Price range etc.,

With User categories, Book categories and the metrics determined, I think that our shelf structure is ready. This is what we can relate to a logical design or in DWH terminology, Data warehouse bus architecture.

Since our current article is written to serve as a quick read, I intend to discuss just whatever is essential for the series and walk you all through few imp definitions.

Note: We are not going to talk about ELT processes and tools in this series as the intent is to talk about modernisation and not the core DWH. Please let me know in case you would like to see specific info on ETL/ELT techniques etc., I can discuss those in a different article. However, as a part of modernisation series, I would be writing about different data migration architectures which will focus on some advanced ETL techniques pertaining to a modern data platform.

What is De-Normalisation — Its broad, but I will try and keep it simple.

Anyone who understands relational databases, also understand the JOINS and the complexity they add while extracting data from multiple tables. They impact the performance and are definitely not a friendly feature to have in the context of analytical processing. To enable “fast analytical processing” although its redundant, platform designers implement the concept of de-normalisation. It defies the rules of 1NF and 2NF.

Facts — Measurement of a metric.

Ex: Average monthly revenue on rentals on Adventure category of the books in my shelf.

Dimensions — descriptive characteristics of facts Ex: History, Adventure

Applying the book-shelf analogy to an e-commerce business, just as we questioned ourselves to understand what is required, a data platform designer talks to business analysts/Domain experts to understand what those questions are in a business context as they are the frequent consumers of the data. Easily, they will have at least 10–15 questions. Multiple interviews with Business analysts across different areas is just enough for us to get started.

Remember, data platform design maturity is an iterative process and does not happen in one go.

Below are the steps that a traditional DWH designer would follow. They are mostly applicable while modernising an existing Data platform as well, so we are still on track and haven’t deviated from the “soul of this article”.

Determine the requirements:

Why do I need my DWH to be built for? | What value does it bring in? | What is going to be my technical debt?

Collect the questions/requirements from different stakeholders of the data

Categorise the questions into different business processes.

Determine the grain of the fact

The granular most level I want the data to be available in à This varies for each business process and hence it is important to consider all of them before finalising the design.

Determine the dimension and the attributes of a dimension.

Determine attributes for the fact table.

Few examples of subject areas/Business Processes in an e-commerce business are given below. This is the outcome of my Step 3 mentioned above

· Sales

· Customer Service/Call centre data

· Inventory

· Advertising

· Finance & Payroll etc.,

Data warehouse bus architecture from a white paper released by Drexel University

For ex: Sales & Marketing teams are one of the primary stakeholders of the data. The data across different dimensions are used by them for customer segmentation (which determines who my best customer is) and to create customer persona (Target audience categories)

They use customer transactional patterns, demographic information, website navigation patterns and many other aspects to derive the marketing strategy.

A quick recap before we move further:

Step 1, we define the business processes/categories based on the interviews or questions — Every business process defines a row in the Data warehouse matrix as shown in the above figure

Step 2, We determine the grain which tells us how granular or atomic level our data should be. This is essential even before we get to design the facts and dimensions. Grain varies for every business process/category.

For ex: My Analytical processing query is, “How many uses visit my website?” — The data warehouse should have an entry of every user visit in order to answer the question.

Step 3, we identify the performance evaluation metrics from each business process or category. Each metric that a process generates defines a fact. These fact tables determine the performance of the business process.

Step 4, We determine the dimensions — Although the dimensions are determined upfront based on the analytical queries, a designer normally gets back to those queries to determine the attributes for each of those dimensions. In short, these dimensions tell us, ‘Who, what, when, where, why and how? About the facts.

Since Kimball’s model is all about dimensional modelling, refer https://www.kimballgroup.com/2009/05/the-10-essential-rules-of-dimensional-modeling/ for more details on essential rules of dimensional modelling.

Below image gives a quick representation of a sample e-commerce DWH schema. In the context of modernisation, most of these might have already been addressed as a part of designing the existing enterprise DWH of your organisations. If you are a student who is new to DWH and have come until this point, I would recommend you go through the link I shared above for details of DWH design. It is V.A.S.T!

In case you are still wondering, whether we are within the context of the article, please see the below service level comparison of traditional DWH vs a Modern DWH (BigQuery). We have had a quick overview of all you see in the left side column of the below table and I will leave it to you to find out what each of it mean in the context of a modernised data warehouse.

In Part 3, we will discuss about what are the characteristics of a modernised data platform.

Part 2 — Modernising a Data Platform & BigQuery Concepts

Written by Nikhil (Srikrishna) Challa