Build Data Warehouse using Talend (Study Case: Ecommerce Dataset, Schema: Star Schema)

Katarina Nimas Kusumawati
11 min readAug 21, 2021

--

Data warehouse is a system for archiving and analyzing historical data in an organization.

Star Schema is a form of data warehouse where 1 fact has many dimensions, so that the shape resembles a star.

The data warehouse has the following characteristics:

  • Subject-Oriented: data warehouse created to meet the needs of a subject such as marketing, sales, finance, and so on. The data warehouse is used for decision support systems. Adjusting the data warehouse as needed can support decision making.
  • Integrated: data warehouse is built by integrating various data sources. This can help to perform data analysis effectively.
  • Time-Variant: the data warehouse has a certain time, so it takes information about the day, week, month, and so on. When a data has been entered into the data warehouse, the data in it cannot be changed.
  • Non-volatile: data in the previous data warehouse will not be deleted when new data is entered. Data is read-only. This will help in analyzing historical data and understanding what & when happened.

The dataset used is an e- commerce dataset with the following OLTP design:

OLTP built with PostgreSQL

There are several relationships used in the OLTP above, there are one-to-many such as seller to order_item_dataset where one seller can sell many items. There is a many-to -many on feedback_dataset to order_dataset. This means one feedback applies to many orders, and one order can have multiple feedbacks. The sign (*) is a sign for many relationships.

Things that need to be done in transforming data

Only for DimFeedback (Dimension Feedback)

Remove Invalid Value (Many-to-many to One-to-many)

In the dataset that I got, there is an value that doesn’t fit. In OLTP, to link feedback_dataset to order_dataset uses a composite key which is a combination of order_id and feedback_id. To link DimFeedback into FactOrder later in DimFeedback there must be only one surrogate key which is not a composite key. If seen, one feedback_id can be used for many orders, and one order can give feedback many times (many-to-many). Here the valid thing is that one order can only provide one feedback, uses the feedback that has the most recent feedback_answer_date. To remove can use SCD (Slowly Changing Dimension). SCD is performed using SQL.

create table staging.feedback_dataset_unique as 
SELECT *
FROM (SELECT distinct * ,
RANK() OVER( PARTITION BY order_id
ORDER BY feedback_answer_date DESC) AS ranking
FROM feedback_dataset) t
WHERE t.rank = 1
order by feedback_answer_date

Choose only those that have rank 1 (the most recent feedback).

Before Removed
After Removed

• Drop order_id

order_id needs to be dropped because later we will only rely on feedback_id to be the primary key.

Before Drop the order_id
After Drop the order_id

• Remove Duplicate Value

The same value needs to be removed because previously it was dropped order_id. It can be done using tUniqRow.

After Remove Duplicate Value

Only for DimCustomer (Dimension Customer)

Remove Invalid Value (Can not join table because not unique primary key)

In user_dataset, one user_name can have different customer information. This is most likely when a customer changes his/her personal information. The problem is in this dataset, it is not found when the first and last time someone changed their personal information, so it is difficult to find the most up- to-date data, plus when this dataset is obtained, the tables are not connected to each other. Here it is concluded that one user can do many orders with user_name on user_dataset is unique so that it can be joined with order_dataset. Finally, SCD is performed, which is sorted by customer_zip_code, customer_city, and customer_state in descending order. Choose only those that have rank 1 (the most recent feedback).

create table user_dataset_fix as 
SELECT *
FROM (SELECT distinct * ,
RANK() OVER( PARTITION BY user_name
ORDER BY customer_zip_code DESC, customer_state DESC , customer_city DESC ) AS ranking
FROM user_dataset_ori_csv) t
WHERE t.rank = 1
order by customer_zip_code

Only for DimDate (Dimension Date)

Generate Date

The first step is to add context to the created job. Here I named it myStartDate. Then set the data type to Date and select the starting date in the date dimension. Here I use January 1, 2016, because the earliest date in my data is in 2016. To set date as much as possible avoid typing manually because errors can occur. In the Default > value column there is a date. On the right there will be a box (not a checkbox) that can be clicked, and it will appear like a calendar. There can be selected according to the earliest date.

Generate Date using Talend
Select Date

In tRowGenerator, in schema add one column and name its date. Set the data type to Date and make sure the function is

TalendDate.addDate(Date,int,Date)

In the Function parameters there are several parameters. First on the date parameter use

context.myStartDate

This means that we use the value myStartDate for our prefix to generate date. Then the parameter nb fill with

Numeric.sequence(“s1”,1,1) -1

Because we still want to include January 1, 2016 in our date dimension table. Without -1 behind it, the first date that appears is January 2, 2016. The last one is the dateType parameter whose value is “dd”. This means that our date is iterated based on date. So, the date will be 2016–01–01, 2016–01–02, 2016–01–03, and others. For example it is iterated by month or “MM”, it will be 2016–01–01, 2016–02–01, 2016–03–01, and others. After that click OK, then right click and connect date to tMap.

Select Functions

Make Date Column

  • The first column is date_id. date_id is the primary key of this table because each date must be different every day. The data type is date. We need this so that it can be joined with our Fact table which contains the date element.
  • Next is the DayNumberOfWeek column or what week is this date located in? For example, if we have January 1, 2016, it must be in the 1st week of January and January 31, 2016, will be in the 5th week of January. Likewise, if we have February 1, 2016, it will be in the 1st week of February. The code for that column is
Integer.parseInt(TalendDate.formatDate(“F”, row1.date))
  • The third column has DayNumberOfMonth or the date number in the month. For example, if we have January 1, 2016, then the day number of month is 1, or we have February 26, 2016, then the day number of month is 26. The code for that column is
Integer.parseInt(TalendDate.formatDate(“d”, row1.date ))
  • The fourth column is DayNumberOfYear or the date number in the year. For example, if we have January 1, 2017, then the day number of year is 1, or we have December 31, 2017, then the day number of year is 365. The code for that column is
Integer.parseInt(TalendDate.formatDate(“D”, row1.date ))
  • The fifth column is WeekNumberOfYear or the week number of the year. For example, if we have January 1, 2017, then the week number of year is 1, or we have December 30, 2017, then the day number of Year is 52. The code for that column is
Integer.parseInt(TalendDate.formatDate(“w”, row1.date ))
  • The sixth column is MonthYear which contains the month of the year with a string data type. This column contains such as January, February, March, and so on. The code for that column is
TalendDate.formatDate(“MMMM”, row1.date)
  • The seventh column is MonthNumberOfYear which contains the month of the year with an integer data type. If January is 1 and December is 12. The code for that column is
Integer.parseInt(TalendDate.formatDate(“M”, row1.date))
  • The eighth column is CalendarQuarter which contains the quarters in the year. We need to know that the first quarter is January-March. So that in that month the CalendarQuarter column has a value of 1. Likewise for the following quarters such as April-June is the second quarter, so the value is 2. But if you add up, it becomes a comma. Because the data type which assigned an integer, the results obtained e.g., 1.667 then the incoming is 1. Code for this column is
(Integer.parseInt (TalendDate.formatDate ( “M”, row1.date)) + 3–1) / 3
  • The ninth column is CalendarYear which means year. So, January 1, 2016, then CalendarYear is 2016. Code for this column is
Integer.parseInt(TalendDate.formatDate(“yyyy”, row1.date))
  • The tenth column is CalendarSemester which contains the semester in the year. One year has 2 semesters. Semester 1 is in January-June and semester 2 is in July-December. The code for this column is
(Integer.parseInt(TalendDate.formatDate(“M”, row1.date)) +6–1)/6
  • The last column is DayName or the names of the days. This column has a string data type and contains the names of the days from Monday to Sunday. The code for this column is
TalendDate.formatDate(“EEEE”, row1.date)
Date Teble

Surrogate Key for DimDate

For fast processing, the surrogate key usually has an integer data type. Because our primary key still has a date data type, we need the help of a surrogate key. Best practice with surrogate key dates is to 20160101. The code for the date dimension surrogate is use the date itself to be an integer. For example, the date is 2016–01–01, the surrogate key is 20160101. The code for the date dimension surrogate is

Integer.parseInt(TalendDate.formatDate(“yyyyMMdd”, row1.date))

Only for DimTime (Dimension Time)

Generate Time

Set the tRowGenerator. Click on tRowGenerator, then add to the Schema field. Then select Date data type, because in talend there is no time data type, but date data type can include time as well. For here the function is

TalendDate.addDate(Date,int,Date)

Then move on to the Functional parameters field. There, for the date parameter, fill in the value

context.myStartTime 

This means the time starts according to what is in the context. Then in the nb parameter, we fill the value with

Numeric.sequence(“s1”,1,1) -1

This means that later our time will repeat itself until it stops at the value we want. Then what does the -1 behind mean? This means that 00:00:00 will be entered in our table. Without -1, it will start from 00:00:01. Next, fill in the dateType parameter with “ss”. This means that the iteration time is based on seconds.

Generate Time

• Make Time Column

  • Make time_id column using row1.time with Date data type and format “HH: mm: ss”
  • Make the Hour code is
Integer.parseInt(TalendDate.formatDate(“H”, row1.time))

This code can be directly written in the expression in gen_time, but because I want to make a code to separate time into MORNING, AFTERNOON, and EVENING then I add a table in the middle with the name (default) is Var. Now in the Var table I wrote the expression earlier and named the variable as hour. Then I drag Var.hour to the hour column in gen_time.

  • Minute:
Integer.parseInt(TalendDate.formatDate(“m”, row1.time))
  • Second:
Integer.parseInt(TalendDate.formatDate(“s”, row1.time))
  • Meridiem (AMP/PM):
TalendDate.formatDate(“a”, row1.time)
  • In the last column I need to aggregate the time into morning, afternoon, and evening, so I create the code
Var.hour<12?”MORNING”:Var.hour<18?”AFTERNOON”:”EVENING”
Time Table

Only for payment_dataset

  • Join Table with Different Aggregation

Payment_dataset is not included as a dimension but will be joined with fact. The problems is the granularity of data is not the same as fact. Granularity fact is per item, while payment_dataset is based on order. Consider the following example.

Example from payment_dataset

And it will be expected to be like the figure below

payment_dataset after aggregated

Because the credit card is one time, the count_credit_card is 1, the total is 100000, so it is included in the total_credit_card which is 100 000. Likewise with the debit_card.

Example code in tMap. In count credit card:

row8.payment_type.equals(“credit_card”)?1:0

In total credit_card:

row8.payment_type.equals("credit_card")?row8.payment_value:0

So, payment_dataset is unique based on order_id. This can be solved using tAggregateRow. This is some clue for aggregating the data.

payment_dataset Aggregation

Only for Fact Order

Outer Join order_dataset and order_item_dataset

Outer join between order_dataset and order_item_dataset since I want the granularity of the fact is per item.

Insert feedback_id in FactOrder

In FactOrder there is no feedback_id yet. So that DimFeedback can be joined with FactOrder is insert the feedback_id.

Replace Natural Key with Surrogate Key

To speed up processing, replace the natural key that is the surrogate key in fact. What about the natural key before? The natural key is still there, but stored in each dimension.

For All Dimensions

Add Row in Dimension for NULL Value in FactOrder

The purpose of creating additional rows in dimension if there is a row that has a NULL value in the FactOrder and that’s in a foreign key column, it can still get the value when joined to the dimension table. The string data type can be written ‘UNKOWN’, while the numeric data type can be written as 0. Before join the fact with dimension, replace NULL value in foreign key column in fact table with 0 so it can be connected.

Additional Row

• Sort Value for Table that Given Additional Row

This step is only done if you add an additional row in dimension for the fact row that has a NULL foreign key. The sorting step is needed so that the added row can be located at the top position so that it can be assigned with surrogate key 0. Sorting is done by id.

• Add Surrogate Key

Surrogate key acts as the primary key in the data warehouse. The purpose of adding a surrogate key is to speed up processing in the data warehouse. Therefore, the surrogate key is an integer. Surrogate keys are usually shorter to speed up processing. In talend, the surrogate key is created by incrementing the value. The code on the talend is as follows

If it starts with 0

Numeric.sequence(“s1”,1,1) -1

If it starts with 1

Numeric.sequence(“s1”,1,1)

For All Table

Uppercase / Lowercase

Uppercase / lowercase is used to uniform the data with string data type. Run this code on tMap.

table_name.column_name.toUpperCase()table_name.column_name.toLowerCase()

Replace Invalid Value

In this dataset, the values in the product_category column in the products_dataset are separated using “_”. An example is the product category bed_bath_table. By replacing “ ” makes the value a bed bath table. This is optional, but by using this, there is no need to clean again when doing data analysis. This can be solved using tReplace.

• Fill NULL with 0

This is also optional, depending on needs. An example of a column that fills with 0 for a NULL value is product_weight_g.

table_name.product_weight_g==null?0:product_weight_g 

If true it is 0, if false it outputs product_weight_g.

• Make Volume Column

This is optional, depending on whether it is required for further analysis. Volume is obtained by multiplying product_length_cm * product_width_cm * product_height_cm.

ERD Datawarehouse:

ERD Datawarehouse

After all of the process, join all the dimensions into the fact. The relationship between dimension and fact is one-to-many. Let me take an example, DimProducts and FactOrder. One product can be sell for many orders.

Thank you for reading!

Glossary:

Slowly Changing Dimension: a column used to store the change value. SCD is added in order to be able to view the history of a data.

--

--

Katarina Nimas Kusumawati

Sometimes I struggle with data, sometimes I just wanna be a Pikachu