Data Warehouse Modelling Concept

Published in

Analytics Vidhya

7 min readAug 14, 2021

Data engineering is a process to provide data in usable formats to the data analytics and data scientists who run queries to perform analytics and applying algorithms against the information for predictive analytics, machine learning and data mining applications.

Background

Why do we need to load the data into database? Why not store our data in files instead of a database and analyse it from the files?

To replicate the business process from an ecommerce, we will need database that act as a place to store data for day to day transactional process, while a warehouse will act as a place to store collection of historical data for analyzing our ecommerce.

Not only that, most of the time data are not in the best shape for analysis. Without data engineering process, we would have to spend more time preparing data analysis to solve complex business problems. Thus, this part (Part I) is a process to make the data ready & available for analysis (Part II) and modelling (Part III)

In this part, we are going to discuss about how we are going to load dataset in csv format into Database Server (PostgresSQL) and designing Data Warehouse from it.

In short, we will discuss :

Identify : Find the relation between dataset
Design : Design OLTP & OLAP to load & analyze data
Implementation: How to load data into from csv into database and processed into warehouse using ETL process (next blog post!!)

ps : This documentation provide the process & result of pipeline that I create & use for Part 1. 💪

Prerequisites
Architecture
Database
Data Warehouse
ETL Pipeline

Prerequisites

First of all, all data used for this project were given by our mentor (primary data). Dataset consists of 7 csv files.

For the project environment, please use :

Architecture

What is the pipeline structure of this project?

Above picture show the outline on what we are going to do to design & model Data Warehouse that we are going to use for Part II & Part III.

Database

Let’s talk about database. The data structure that were used in database is called relational model. It represents data as tables and relations among those tables.

Every row in the table called a record, every column is called a field.

The data type of a column defines what value the column can hold: integer, character, date and time, binary, and so on. Each column in a database table is required to have a name and a data type. As basics, I will try to review the datatypes used in our project.

After loading the .csv dataset into database, we need to identify the relationship between data. The relationship between can be seen below

After we map the relationship, we can make create & load the data into database. Here are the schema that we will use as our database schema.

Data Warehouse

The data warehouse is a specific infrastructure element that provides down-the-line users, including data analysts and data scientists, access to data that has been shaped to conform to business rules and is stored in an easy-to-query format

The data must conform to specific business rules that validate quality. Then it is stored in a denormalized structure — that means storing together pieces of information that will likely be queried together. This serves to increase performance by decreasing the complexity of queries required to get data out of the warehouse (i.e., by reducing the number of data joins).

In order to answer the business question that we have (will be explained in Part II), we need to create a data warehouse schema that can provide information that we will use for Part II & Part III. We will use star schema for our warehouse. The granularity for our warehouse are 1 product for each order, which means that for each row in Fact Order Item has the details for 1 product from an order.

Before we start, I would like to give an overview about concepts that we will using to model the data warehouse :

Star schema is a mature modeling approach for data warehouse which requires modelers to classify their model tables as either dimension or fact.

Dimension tables describe business entities — the things you model. Entities can include products, people, places, and concepts including time itself. The most consistent table you’ll find in a star schema is a date dimension table. A dimension table contains a key column (or columns) that acts as a unique identifier, and descriptive columns.

Fact tables store observations or events, and can be sales orders, stock balances, exchange rates, temperatures, etc. A fact table contains dimension key columns that relate to dimension tables, and numeric measure columns. The dimension key columns determine the dimensionality of a fact table, while the dimension key values determine the granularity of a fact table.

A slowly changing dimension (SCD) is one that appropriately manages change of dimension members over time. It applies when business entity values change over time, and in an ad hoc manner. A good example of a slowly changing dimension is a customer dimension, specifically its contact detail columns like email address and phone number. In contrast, some dimensions are considered to be rapidly changing when a dimension attribute changes often, like a stock’s market price. The common design approach in these instances is to store rapidly changing attribute values in a fact table measure.

Star schema design theory refers to two common SCD types: Type 1 and Type 2. A dimension-type table could be Type 1 or Type 2, or support both types simultaneously for different columns. But here, we only use Type 2 SCD (or not SCD at all like for time dimension).

Type 2 SCD

A Type 2 SCD supports versioning of dimension members. If the source system doesn’t store versions, then it’s usually the data warehouse load process that detects changes, and appropriately manages the change in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example, IsCurrent) to easily filter by current dimension members.

So, to create the data warehouse with star schema, we need dimension and fact table

In star schema design, a measure is a fact table column that stores values to be summarized. Here, our fact table store 5 different types of measure from each product from an order :

Price
Shipping cost
Payment value
User lifetime spending
User lifetime order

To connect data between fact table and dimension table, we need a surrogate key. A surrogate key is a unique identifier that you add to a table to support star schema modeling. By definition, it’s not defined or stored in the source data. Commonly, surrogate keys are added to relational data warehouse dimension tables to provide a unique identifier for each dimension table row.

After we prepared all of the dimension and fact table, here are the end result (schema) for our data warehouse modelling.

ETL Pipeline

To load data from our OLTP (Database) into OLAP (Data Warehouse) we need a pipeline for ETL. To do that, we need to create an ETL process for each dimension and fact table.

The image above lays out the processing structure for our etl pipeline. Also, we use PostgresSQL for data storage, our transformation logic to move data from A (OLTP) to B(OLAP) is carried out using SQL and Python (no 3rd party tools).

For the data source, I mainly use 2 types of data sources :

Database (OLTP)
External Data (Flat Files), such as Cities Map and Datetime in excel

That’s the end of the concept that we use when implementing data engineering part. For the implementation itself, please kindly check the Part I-II : Data Warehouse Modelling Implementation. Stay safe and stay grinding 🔥!!