Evolution of Data Warehouse — Part I

Ravikiran durbha
Data And Beyond
Published in
5 min readMay 29, 2021

We saw that BI makes a difference, if implemented well — https://medium.com/business-intelligence-and-analytics/does-business-intelligence-bi-really-make-a-difference-7e70753d7c8b.

The implementation, of course, starts at the foundation — Data Warehouse.

This is where data is persisted before business can extract intelligence and there are a few patterns used today for this purpose. To help understand the intuition behind these patterns, it is instructive to look at the history and the evolution of Data Warehouse.

E.F. Codd, an English Computer Scientist, while working at IBM invented Relational Model to persist data, back in 1970. This was in fact the very first step towards data democratization. Before this, it required a fairly sophisticated knowledge of how computers worked to retrieve data. Now, it just takes a few days of training in “SQL — Structured Query Language”.

Figure 1: The key idea of Normalization in a Relational Model to remove redundancy

This simple model was immensely successful and every system used it to persist data. The process of sifting data into set of “Parents” and “Children” is technically known as Normalization. The key idea here is to reduce data redundancy. As seen in the Figure 1 above, in the normalized structure “Parent1” and “Parent2” are not repeated like they are in the flat structure. Of course, this also allows Set operations, like Intersection and Union to be applied — which is the mathematical basis of SQL.

Reducing redundancy eases data maintenance, as you have to update at only one place if it changes, albeit, efficacy of data retrieval is compromised. You will generally need to stich the children and parents back together again to retrieve the data and analyze it. This trade-off gives rise to most patterns as they optimize for one or the other.

The Relational Model (also known as Third Normal form or 3NF — The “three” refers to the degree of normalization, which we won’t go into, to keep it simple) was very successful in storing data for transactional systems, like airline reservation systems, point of sale etc. Hence, It was also used for warehousing , but, soon they ran into challenges.

There were two main problems:

  1. Data retrieval was not efficient, especially for larger data sets.
  2. It was important to track history in BI.

As mentioned before, data normalization makes retrieval expensive in terms of computing at scale(you have to stich the parents and children back every time you retrieve data). But, what about tracking history? Well, before we answer that, we need to digress and address something we glossed over in the previous section.

You might have noticed in the normalized structure (in Figure 1) we have arrows pointing from parent to child and these are normally implemented as “keys”. A key uniquely identifies an entity in a set. So, we create an unique key for every parent and this key is stored with the corresponding children in the children set, effectively creating the arrow.

Now, back to our history tracking problem. Instead of creating a key to uniquely identify a parent, we can create a key to uniquely identify a version of a given parent as shown below in Figure 2.

Figure 2: Normalized with versioning

This works in principle, but creates a “key cascading” problem. Imagine version 4 of parent1 shows up, we need to grey out version 3, create a new key and propagate that key to all its children. It may not seem like much, but imagine changes to millions of parents with multiple children for each parent. Not to mention tracking history for children as well, with each child having multiple children. If somehow, we didn’t have to propagate the keys, we would be set.

One way to do that is go back to the flat structure, at the expense of redundancy and hence increase in size , as shown in Figure 3. You will get a new pair of parent and child every time a parent or child changes. This is referred to as De-normalization, one of the many principles that leads to dimensional models (DM or star-schema), a methodology that emerged in mid-1990’s and was popularized by Ralph Kimball — A PhD graduate from Stanford University, who was also one of the pioneers of graphical user interfaces for programming.

Figure 3: Versioning in Flat Structure

In this type of model, these flat structures are called Dimensions and contain all the parent-child hierarchy of a given entity. Some examples of these dimensions would be — Product, Customer, Geography, Date etc. These dimensions(or entities) come together to form a business process transaction (A customer buys a particular product at a given geographic location on a specific date) and the fact of the transaction is recorded at the intersection of these dimensions. This may include measures like number of items purchased, total dollar amount etc. The structure at the intersection is called a “Fact table” (Clearly, names were not top priority), as shown in Figure 4. One can track history of each dimension and since it is a flat structure, we have mitigated the “key cascading” problem and resolved data retrieval to an extent.

Figure 4: Showing dimensional model as a collection of flat structures (dimensions) and Facts

This pattern became immensely successful, through the early 2000’s, in data warehousing and most organizations were using it. There is a lot of material written about this pattern by Ralph Kimball himself and others that cover the many different ideas within Dimensional Modeling. Just to give you a taste , there are at least seven different patterns just for tracking history in dimensions.

Eventually, though, challenges emerged and out of these challenges evolved a new pattern — Data Vault (DV). What are these challenges and how does DV solve them? Well, we will look at these in Part II , but it would be remiss not to mention that all three patterns — 3NF, DM,DV have their niche in the data world today and there is not one pattern that tackles all problems.

You can read Part II here — https://medium.com/business-intelligence-and-analytics/evolution-of-data-warehouse-part-ii-52a04b7189f3

--

--