Data Warehouse lookup patterns, Junk Dimensions — shall I use them?

Published in

techburst

6 min readJan 9, 2018

On a previous article I referred most of ETL overhead is related to the data warehouse fact table record enrichment process. Indeed, what it does is mostly adding attributes to data so specific analysis can be executed against it. Only after being able to pull information about data one is aware of its real power and I must mention, despite having over than eighteen years professional experience, the following ebook helped me understand the benefits of a properly-built Data Warehouse.

I will dedicate this tutorial to another approach that will allow minimizing the lookup iterations ETL process needs to execute when parsing raw data — Junk Dimensions.

The term Junk is used because in the end you will end up with a single dimension that will merge attributes that may not have any kind of relation to each other, but that is the base idea indeed. So let’s consider a base dimensional model to work with, check below please

This could be a model applied to some sales area where the fact record stores each of the three dimension tables surrogate keys depending on each client relevant attribute. The schema references the number of rows each dimension has once it is an important detail for the logic I will present.

The values each one of the above dimensions can have are presented in the three lists below. Client type can identify if the client is Private a Company or Public Sector. On the right side Client satisfaction could measure the replies clients made to some specific enquiry and New Client could work like a flag that assumes values “New Client” or “Old Client”. The logic behind new or old could relate to the fact the client came from previous year data or is new to the current year’s data.

Dimensions’ underlying data could be the following

Now let’s imagine our dimensional model has over than twenty or even thirty dimensions around our fact table. It is easy to understand that, if each dimension corresponds to a fact foreign key, for sure twenty or thirty lookup operations will be executed per each inserted fact row. Any approach that allows reducing the number of lookups will reduce the entire ETL chain running time.

A possible approach is dimension combination. If we look at the number of rows each one of the previous dimensions have, it is possible to obtain a unique dimension that will enumerate all possible combinations among the three tables. This means we will end up with a unique dimension that will have 5 x 4 x 5 = 100 rows. The table will be built from the original lookup dimensions and the required SQL will be as simple as the query presented below. The query just needs to run a cartesian join between the three tables, a possible one could be as follows

SELECT ROW_NUMBER() OVER (ORDER BY A.Client_Satisfaction_Dsc,
C.New_Client_Year_Dsc,
B.Client_Type_Dsc) AS Client_Cross_Dimension_Sk,
A.Client_Satisfaction_Dsc,
C.New_Client_Year_Dsc,
B.Client_Type_Dsc,
A.Client_Satisfaction_Dsc + ‘#’ +
C.New_Client_Year_Dsc + ‘#’ +
B.Client_Type_Dsc AS Client_Cross_Dimension_Lkp_Key
INTO Dim_Client_Cross_Dimension
FROM Dim_Client_Satisfaction AS A
CROSS JOIN Dim_Client_Type AS B
CROSS JOIN Dim_New_Client_Year AS C

It is possible to understand the query is running a cross join operation between the three tables. I added an extra column in the end that concatenates the three descriptions into a single column. The reason for that is indexing. It will be faster for the database to perform a lookup operation against a single column properly indexed with a unique index than to apply lookup conditions to the other description columns isolated. So to make sure our lookup process runs as fast as possible an index should also be created on this new table. That index should have Client_Cross_Dimension_Lkp_Key in first place and Client_Cross_Dimension_Sk in second place.

CREATE UNIQUE CLUSTERED INDEX IXC_Dim_Client_Cross_Dimension ON Dim_Client_Cross_Dimension
(
Client_Cross_Dimension_Lkp_Key ASC,
Client_Cross_Dimension_Sk ASC
)

The ETL lookup process should first concatenate the fields and then apply the lookup operation against the concatenated column. The initial concatenation process should be faster than doing an individual search per each description field. Also, usually data is first stored in staging area tables so this concatenation can be made on that load step as well and stored as a new auxiliary raw data column. The final result will be a table with 100 rows that will look like the next one (some rows were omitted)

This table readability is low to human eye, but in the end it really does not matter as usually it will be used by exploring tools so users just need to drag and drop the required dimension column and apply the desired filters.

As far I am aware that is the main disadvantage of these tables. It is also possible to obtain very big Junk Dimensions if the base ones have some records already. Remember, you will be producing a cartesian join between several sets so, on each combination step, you increase the combinations by multiplying the already generated ones by the number of rows of the one being processed. Dimensions with hundred rows or more if combined together, will generate big tables, pay attention to that. Even though using highly optimized lookup indexes may make the lookup process faster, the developer is invited to test the approaches before deciding which one to use.

The main advantages of this approach are:

Reducing the number of foreign keys in the fact table. This example used three dimensions but, imagine you have ten foreign keys in the fact table all pointing to the same True / False analysis dimension (a dimension that stores the description of a True / False flag). If a single Junk Dimension is built (that will have 2¹⁰ = 1024 rows) you will achieve the same outcome with a unique foreign key. Also, due to low cardinality, Oracle bitmap indexes (for instance) would work very well over that True / False columns dimension table allowing very fast filtering operations;
Will reduce the number of join operations executed between the fact table and the dimensions. If you reduce the number of foreign keys and, continuing the previous idea, you will be able to achieve the same result doing a single join while if you have ten foreign keys you need to perform ten join operations. I am assuming conditions will be applied to all fields presented in the created Junk Dimension;
The lookup process will run faster, but always analyze the balance between Junk Dimension acceptable size versus several dimensions lookup approach. Tests are required so proper and sustained decisions are made. The advantages will depend on the database engine being used and its capabilities (mostly index types available and underlying hardware);

Data Warehouse developers, as usual, must not consider these guidelines as dogmas. The best approach will always be to build a proof of concept. If no data is available generate random one and test approaches as a solution that may seem to be a good one when dealing with few thousand records may not be as good in the long term when database size will reach several million records. Sustained creativity is the way to go.

Data Warehouse lookup patterns, Junk Dimensions — shall I use them?

Written by Pedro Fidalgo