Evolution of Data Warehouse — Part II

Ravikiran durbha
Data And Beyond
Published in
5 min readJun 24, 2021

As promised in the previous post — https://medium.com/business-intelligence-and-analytics/evolution-of-data-warehouse-part-i-ee1438293853 , let us examine some challenges that emerged from DM.

But, before we do that, let us remind ourselves what a star schema is — In a Star Schema (or DM), the dimensions (Like Product, Customer, Geography, Date, etc.) come together to form a business process transaction. The facts of this transaction are recorded at the intersection of these dimensions in a structure called “Fact table”.

The dimensions give context to the facts. In Figure 1, you can see that the “Number of Items Purchased”, for instance, is just a number until you specify “who” (Customer) , “what” (Product), “when” (Date) and “where” (Geographic Location) to give it meaning.

Figure 1: Dimensions give context to facts

Now, let us assume, we built this star schema for a department store and they want to enhance it and add additional context, say the department where the product was sold (Maybe Sports, Home Goods, Office Supplies etc.). It is easy enough to add this dimension, but what do we do of all the facts we have collected so far? Clearly, we do not have this contextual information for those facts. Now, this may not seem like much of a problem. After all, why can’t we add the dimension and only populate it for facts going forward? Well, an example is in order to help illustrate the problem better.

Figure 3: A simple Fact Table with 2 dimensions.

Imagine a very simple fact table with two dimensions , as shown in Figure 3 (Shown as a matrix more than a table as this drives the point better). You can see that every cell has a single number and hence has the same granularity. This enables aggregating the numbers row-wise or column-wise and the result would maintain a legitimate business context . For example, if we sum across the row labeled “NC”, we would get total worth of goods sold in the months Jan-Apr 2021 in the state of North Carolina. Similarly, adding Jan 2021 column, we get the total worth of goods sold in Jan 2021 across all the states we do business in.

Now, imagine in May 2021 we decided to add another dimension, “Department”, as shown in Figure 4. We can see that the column of May 2021 is at a different granularity and hence aggregating across months is more cumbersome and the onus will be on the analyst to compute correctly. Any pre-existing SQL code will need to be modified to account for this change, because it does not know how to handle two numbers in the cell instead of one.

Figure 4: Adding a new dimension to the fact table will change the granularity

As we add more dimensions in future, the problem is only exacerbated. Imagine as we add more cells to the matrix above, they are broken down into progressively smaller pieces. The whole idea of a fact table is to enable easy aggregations on facts across all dimensions. If some of the facts have more of a context than others, it impedes these operations. Once a fact table is populated, it becomes brittle and not amenable for adding new dimensions.

Figure 6: Fact tables are brittle and not amenable to adding new dimensions.

In early 2000s, “Agile Methodology” was picking up steam in application development. In short, this method entailed more iteration, with developers soliciting feedback from users at regular cadence and providing incremental value. Identifying all the dimensions of a star schema before using it makes it unsuitable for “Agile methodology”. There may be many source systems feeding into a data warehouse and as we tackle different source systems, we may discover new dimensions. Even if we know all the dimensions, it may take time to model all the hierarchies within a dimension correctly and you cannot show incremental value to business during this process.

Around the same time, Dan Linstedt published Data Vault modeling. It turns out, he solved the brittleness problem in an ingenious way. He recognized that the Fact Table is brittle because it is a child of many different parents. So, he turned the star schema inside out, making the central table the parent, as shown in Figure 7. The idea being that a parent can have a new child anytime, but it is harder to give a child a new parent without changing its constituents. Of course, there is a lot more to Data Vault, but this was the genesis.

Figure 7: Inside out star-schema to solve brittleness problem

He called the central table, a Hub and the entities around it as Satellites. Hubs do not store facts like the Fact Table. It is rather a set of identities of various instances of an entity. All the details of these instances are then collected in the satellites. As we learn more details about the entity, we can add more satellites in an agile way. The satellites can be de-normalized just like dimensions enabling history tracking without key propagation (As seen in part I). Different Hubs can be connected using links, making it extensible, like Lego blocks. There is a lot of material online if you are interested in learning more about the Data Vault — https://www.data-vault.co.uk/what-is-data-vault/.

As we can see different challenges gave rise to different patterns, but no one pattern made the others obsolete. In fact, all three patterns we have seen (3NF, DM, DV) can work effectively in cohesion to enable BI.

--

--