Data Warehouse Modelling Implementation

5 min readAug 14, 2021

In this post, I will try to explain the implementation of my pipeline / architecture design to model data warehouse. You could refer my previous post to gain more details about the concept & design for my data warehouse implementation. Also, please keep in mind that this blogpost will mainly explaining how the data extracted from OLTP into OLAP (ETL Process and how to do it with SQL only), so trivial things such as creating the database / data warehouse will not be explained

Setup
Import Data
Data Cleansing
Dimension Table Insertion
Fact Table Insertion
Quality Check

Setup / Installing prerequisites

Import Data

To import data from .csv files into PostgresSQL, we simply use the COPY command from PostgreSQL. Code below show how we import order dataset into database.

Example of copy command usage

For complete implementation on how to import all the data (7 dataset) into our OLTP, please kindly refer to this link.

Data Cleansing

Before we insert data into dimension table or fact table, we need to check if the data is clean enough to be used for the next part. From my analysis, I see that user data in our OLTP is somewhat quite misleading. The reason is because there are some duplicate data (not only on primary key, but in the whole row). This is quite reasonable, because the amount of row in user table and order table is the same (99441 rows). Thus, the reason of duplicated data might indicates that user table in database is used to create track record for user info for each order.

But, what we need was just a single info per user. Data cleaning is needed to remove unnecessary rows. To do this, I perform ranking in SQL followed by distinct so we only get 1 instance of data per user.

Cleaning user data scripts

Dimension Table Insertion

The ETL Pipeline for inserting data from database into dimension table in data warehouse could be simplify with below figure.

Flowchart for inserting new/updated data into dimension table

From the figure above, we can see that we will try to insert all the new data into the dimenstion table. Then we will try to find and insert the newly changed data (updated data) as new row in dimension table. In the end, we will try to update the second latest version flag so the latest one will be recognized as latest version in SCD II.

Inserting new data pipeline into dimension table (Product)

The process for updating the existing data in dimension table is called upsert (update + insert). While there are many implementation available, here are the implementation that I use for current project.

Upsert updated data pipeline into dimension table (Product)

Fact Table Insertion

To insert data from main table in database into fact table, we need to make sure that all dimension table have been updated and have a defined interval. For analogy, let say we run the ETL Pipeline every day. Then, the interval for the ETL Pipeline would be from yesterday until today (that would be the data that would be processed and inserted into fact table as new historical data). But, for simplicity of this project, we will insert the whole transaction into fact table (all order).

The pipeline for inserting data into fact table

Quality Check

As we know, ETL work-flows involves all kinds of complex calculations and transformations on the data based on client needs. In order to maintain the quality of data from etl pipeline we need to throughly check the results.

To do this, I mainly do quality check around these 4 aspect:

Data Completeness
Data Cleaning
Data Transformation Check
Data Integrity

Data Completeness

Probably the most obvious test is to check whether all records have been loaded into the data warehouse. For this ETL Pipeline, this can be as easy as comparing a record count of the source and the target table

1. Compare record counts between source and target that checks for any rejected records. 
2. Checks on truncated data to make sure that data should not be truncated in the column of the target table. 
3. Check whether unique values are loaded into the target. No duplicate records should be existing. 
4. Check boundary value analysis (ex: only >12-31-2018 year data has to load into the target)

Data Cleaning

Unnecessary columns should be deleted during the pre-staging process of an ETL pipeline. In this case, we clean the user dataset (explained in previous steps). To check this process, I simply apply an easy rule : That is to check the uniqueness of all user in the dimension table.

Data Transformation Check

During a transformation process in an ETL, column values are changed and newly calculated columns are created. These transformations need to be validated. While all these transformations should be documented in some way as this can be easily forgotten. It’s obvious that good testing goes hand in hand with good documentation.

1.Date Check: follow Date format and it should be same across all the records. Standard format : yyyy-mm-dd etc..
2.Null Check: Few columns should display "Null" based on business requirement Example: Delivery date column should display null unless & until it has order status "Delivered" 
3.Flag check: Some columns would have the flag values, like dimension table (is_current_version) flag for SCD II. Make sure that before / after etl run the flag still correct.

Data Integrity

A lot of tables in the DWH, especially fact tables, contain a lot of foreign keys to related tables and referential integrity in a data warehouse should not be enforced by foreign key constraints. To ensure the quality of warehouse, this aspect need to be checked.

Referential integrity checks: This makes sure the relationship that exists between parent and child table are appropriate as per logical models
Dimension table checks: All dimension tables have a unique and numeric primary key, foreign key
DDL check: This is a highly recomended check by data architects to make sure that the data table definition defined for all tables follow the appropriate physical data models defined in the system.
ETL Load audit checks: Are all audit columns available in every table

Please keep in mind that I do the quality check manually without any framework (which means using SQL query manually). For bigger project, I would suggest using more sophisticated ETL Pipeline quality checking framework such as Informatica Powercenter, or for open source solution you can refer to DbFit ( http://dbfit.github.io/dbfit) and AnyDbTest ( https://anydbtest.codeplex.com/).

As we reach the end of Part I, I would like to give you my reference that help me pass through this part.

Thank You 🙏!!