Data Vault . Much needed breather for Data Warehouse.

Published in

towardsdataanalytics

11 min readMay 18, 2021

Data Vault : Sounds like some safe with lot of jewels in it . Isn’t it . Well terms can be jazzy in data engineering field . So what exactly is this Data Vault ..?

Data Vault is a methodology for large scale Data Warehouse implementations. It is a way to accelerate the flow of data in an enterprise level implementation where you are dealing with huge no of source systems . Now before going into details of this methodology let’s first have a look on the issues ,existing Data Warehouse implementations are facing .

Enterprise Data Warehouse approach

A typical enterprise Data Warehouse . We have a bunch of source systems then a staging layer which also acts as the landing layer of all of your data. Then post transformation of this data we have a Data warehouse layer from where data is being sent into multiple consumption channels. It can either be a pre-aggregated OLAP cube or tabular objects .

Now what issues do you see in the above architecture .

Time to Market: The Enterprise Data Warehouse must first integrate data from each of the source systems into a central data repository before it’s available for reporting, which adds time and effort to the project.
Complexity and Skill: A data warehouse may need to integrate data from a hundred sources, and designing an enterprise-wide data model to support a complex business environment is a significant challenge that requires highly skilled data modelling experts.
Lack of Flexibility: A third normal form model tends to model the existing data relationships, which can produce a relatively inflexible solution that needs significant rework as additional sources are added. Worse still, over-zealous data modelling experts often attempt to overcome this by delivering over-complex generic models that are almost impossible to understand.

Now if you don’t want to go though this route there is an alternate approach . That is Dimensional Data modelling .

Dimensional Design approach

In this approach once the data comes to landing area, then business logic is applied on top of that data and post that data gets segregated into dimensions and facts tables . With this approach the time to market or in simple terms the time to delivery get reduced by a lot but not all fairly tales get to live longer. There are certain issues with this approach as well .

1. Increasing code complexity: The ETL code (Extract, Transform, and Load) becomes so complicated that it is no longer manageable.Performing data cleansing, de-duplication and data confirmation in a single code base becomes very hard indeed.

2. Lack of Raw Data: As the landing area is purely transient (deleted and reloaded each time), we have no historical record of raw data. This makes it difficult for analysts to discover valuable new data relationships, and the increasing importance of Data Science, which (above all) needs raw data, is simply ignored.

3. Managing History: As there is no history of raw data , it became difficult to back-populate additional data feeds.

4. Data lineage: As both the technical and business logic is implemented in ever-increasing sedimentary layers of source code, it is almost impossible to track the lineage of a data item from the report back to the source system.

So we have seen issues from above mentioned two approaches in data modelling . Although these days people are doing hybrid mix and match between these two approaches to arrive at the optimum design . Questions is in your landscape how long that design is going to sustain , well that’s a matter of question you are best suited to reason with.

Now let’s see how Data Vault solves these issues .

Data Vault approach

In the initial impression this architecture looks very similar to the Enterprise level Data warehouse . This is because Data Vault is also employing a landing layer to house incoming source data. But once the data leaves landing layer and starts flowing towards Vault area , that’s where the difference is visible . Major differences are as follows:

Data Loading: As the data is loaded from the Landing Area into the Raw Data Vault, the process is purely one of restructuring the format (rather than content) of the data. The source data is neither cleaned nor modified, and could be entirely reconstructed without issue.
Separation of Responsibility: The Raw Vault holds the unmodified raw data, and the only processing is entirely technical, to physically restructure the data. The business rules deliver additional tables and rows to extend the Raw Vault with a Business Vault. This means the business rules are both derived from and stored separately from the raw data. This separation of responsibility makes it easier to manage business rule changes over time and reduces overall system complexity.
Business Rules: The results of business rules, including de-duplication, conformed results, and even calculations are stored centrally in the Business Vault. This helps avoid duplicate computation and potential inconsistencies when results are calculated for two or more data marts.
Data Marts: Unlike the dimensional data model in which calculated results are stored in Fact and Dimension tables in the Data Marts, using the Data Vault approach, the Data Marts are often ephemeral, and are may be implemented as views directly over the Business and Raw Vault. This means they are both easier to modify over time and avoids the risk of inconsistent results. If views don’t provide the necessary level of performance, then the option exists to store results in a table.

So how does things gets sorted in Data Vault . Well for that we need to go into the internal mechanics of it .

Data Vault is a detail oriented , historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. The design is flexible, scalable , consistent and adaptable to the needs of the enterprise .

Now let’s see the building blocks of Data Vault . In Data Vault all the data gets grouped into three type of tables . These three types are :

i)Hubs

ii)Links

iii)Satellites

Now let’s understand the relationship between these three important pillars with the reference of an eCommerce data set .

Hubs : It is based on an identifiable business element . An identifiable business element is an attribute that is used in the source systems to locate data,otherwise known as business primary key. The business primary key has a very low propensity to change and usually is not editable on the source systems.

Link : A link is an association of two or more business keys . It is based on an identifiable business entity relationships.It can contain Hub keys and other link keys.A link’s business key is a composite unique index.

Satellite: A satellite is based on a non-identifying business elements.Predominantly it is descriptive data. Satellite data changes , sometimes rapidly , sometimes slowly.Satellites are separated by type of information and rate of change.

Satellite has many salient features :

i)Satellite is dependent on the Hub or link key as parent

ii)Satellite is never dependent on more than one parent table

iii)Satellite is not a parent table to any other table

iv)Sequence no, Business Key,Load Date,Load End Date,Descriptive Data,and Record source are mandatory.

The Data Vault Advantages

Data Vault addresses the difficulties inherent in both the 3rd Normal Form Enterprise Data Warehouse and the Dimensional Design approach by combining the best aspects of both in a single hybrid approach. The advantages include:

1. Incremental delivery: While it is sensible to build any Data Warehouse within the context of an overall enterprise model, Data Vault supports entirely incremental delivery. Just like Dimensional Design approach, you can start small and incrementally add additional sources over time.

2. Flexibility: Unlike the Enterprise modelling approach, which can be inflexible, Data Vault requires no rework when adding additional sources. As Data Vault stores the Raw and Business derived data separately, it supports changes to business rules with ease.

3. Reduced Complexity: As Data Vault is built out in a two-step approach, it separates the technical data restructuring from the application of business rules, which helps isolate these potentially complex stages. Likewise, data cleaning is considered a business rule and can be managed independently of the initial data load effort.

4. Raw Data Included: Recording the raw data in Data Vault means it’s possible to back-populate the presentation area with historical attributes that were not initially made available. If the Data Marts are implemented as views, this can be as simple as adding an additional column to an existing view.

5. Elegantly supports change over time: Similar to the slowly changing dimension in the Kimball approach, Data Vault elegantly supports changes over time. Unlike the pure Dimensional Design, however, Data Vault separates the Raw and Business derived data and supports changes resulting from both the source system and the business rules.

6. Lineage and Audit: As Data Vault includes metadata identifying the source systems, it makes it easier to support data lineage. Unlike the Dimensional Design approach in which data is cleaned before loading, Data Vault changes are always incremental, and results are never lost, which provides an automatic audit trail.

7. High-Performance Parallel Loads: With the introduction of Hash Keys in Data Vault 2.0, data load dependencies are eliminated, which means near real-time data loading is possible in addition to parallel loads of terabytes to petabytes of data.

8. Possible to Automate: While both Entity Relationship Modelling and Dimensional Design require time and experience to build skills, Data Vault tends to easier to automate, and there are several tools (listed below) to help deliver the solution.

The Drawbacks of Data Vault

Data Vault is not the perfect one size fits all solution for every data warehouse, and it does have a few drawbacks that must be considered. These include:

Lots of Joins: A poorly designed Data Vault design will produce a massive number of source system derived tables, but even a well-designed solution multiplies the count of source tables by a factor of 2 or 3. The number of tables and therefore joins can appear unwieldy and lead to complex join conditions. This can, however, be addressed with the correct use of bridge tables in the Business Vault, and as with any solution, it’s a trade-off of apparent complexity and flexibility.
Ad hoc reporting: For Ad hoc reporting you have to navigate via views
Two data warehouses — Twice the cost ?

Where to use Data Vault?

Data Vault requires some rigor in delivering good design and sticking to the Data Vault 2.0 principles. Like the Enterprise Data Warehouse, it is designed to integrate data from several data sources and may, therefore, be overkill in some situations.

In summary, if you have a small to medium-sized analytics requirement, with a small (under 10) team of architects, designers, and engineers delivering a solution with data sourced from a few systems, then Data Vault may be inappropriate for your needs.

If, however, you have a large project with 30 or more source systems leading to an enormous data integration challenge and are prepared to take on the skills and rigor of a new methodology, then Data Vault can potentially add massive value to the project.

Currently Data Vault 2.0 is the recent prevailing version . So what is the difference between Version 2.0 and it’s predecessor. Let’s look

Difference between Data Vault 1.0 and Data Vault 2.0

1. Data Vault 2.0 is a complete system of Business Intelligence. It talks about everything from concept to delivery. While Data Vault 1.0 had a major focus on modelling and many of the modelling concepts are similar, Data Vault 2.0 goes a step further and talks about data from source to business user facing constructs with guidelines for implementation, agile, virtualization and more.

2. Data Vault 2.0 can adapt to changes better than pretty much ANY other architecture. It can do it even better than Data Vault 1.0 because of the change in design to adapt to NoSQL and MPP platforms, if needed. Data Vault 2.0 has successfully been implemented on MPP RDBMS platforms like Teradata as well.

3. Data Vault 2.0 is both “big data” and “NoSQL” ready. In fact, there are implementations where data is sourced in real-time from NoSQL databases with phenomenal success stories. One of these was presented at the WWDVC 2014 where an organization saved lots of money by using this architecture.

A near real-time case study for absorbing data from MongoDB is being presented at WWDVC2015. It’s not to be missed.

4. Data Vault 2.0 takes advantage of MPP style platforms and is designed with MPP in mind. While Data Vault 1.0 also did this to an extent, Data Vault 2.0 takes it to a completely other level with a zero-dependency type architecture. Of course, there are a few caveats, but we’ll cover those as well.

5. Data Vault 2.0 lets you easily tie structured and multi-structured data where you can join data across environments easily. This particular aspect lets you build your Data Warehouse on multiple platforms while using the most suited storage platform to the particular data set. It lets you use a truly distributed Data Warehouse.

6. Data Vault 2.0 has a greater focus on agility with principles of Disciplined Agile Delivery (DAD) embedded in the architecture. Again, being agile was certainly possible with Data Vault 1.0, but it wasn’t a part of the methodology. Data Vault 2.0 is not just “agile ready”, it’s completely agile.

7. Data Vault 2.0 has a very strong focus on both automation and virtualization as much as possible. There are already a couple of automation tools in the market that have the inventors approval, and he’s open to working with more vendors.

Data Vault 2.0 is real-time ready, cloud ready, NoSQL ready and big data friendly.Data Vault 2.0 is really getting ready to rock the world with it’s success stories and case studies

That’s it for now . Keep following my blogs . More information on the way.

Data Vault . Much needed breather for Data Warehouse.

The Data Vault Advantages

The Drawbacks of Data Vault

Where to use Data Vault?

Written by Prag Tyagi