An Introduction To Data Vault 2.0

Dakai Zhou
4 min readMar 29, 2023

--

If you are from data engineering background, you may have heard about the term Data Vault. Data Vault is a methodology of designing, implementing and managing a data warehouse. Here, I would like to share my understanding of Data Vault 2.0(DV2), as well as its structure, advantages and disadvantages.

Data Vault 2.0 Architecture

DV2 is a 3-layers based architecture: the staging area, which collects raw data from source systems; the enterprise data warehouse layer, developed as DV2 model; information delivery layer, with information marts as star schemas and other schemas.

DV2 structure

Stagearea

The Stagearea loads batch data from data source without any transformation, except it adds meta data, like LoadDate, SourceSystem, Sequence, etc. It does not change table structures, i.e. primary keys, column names, data type. But all foreign keys, default values and indexes are dropped. All columns are nullable, because we want Stagearea to load all data from source system, even it is invalid data. With all the conditions above, raw data can be loaded to Stagearea as fast as possible.

In addition, Stagearea holds no historical data, only the data has not been inserted into data warehouse. This helps to save system resources and improve data warehouse performance without losing any information.

Data Warehouse Layer

The data warehouse layer is modelled under DV2, it holds the whole history data of the source system. Ideally, this layer tracks every data changes of the source systems. Data from different sources can be merged into one table by the business key. The data is usually not modified; they are in the same granularity as they are provided in the source systems.

Information Delivery Layer

Information delivery layer is where the end users have the access. It delivers the data in the most comfortable way as the end user preferred. This layer is ready for reports or any other data use by the business.

Data Vault 2.0 Format

In DV2, each source table is split into at least two tables, hub table and satellite table. If the source data object connects to another data object, a link table will be created.

Hub table contains columns like the business key of the data object, load date and data source. The business key is a unique identifier of a data object that makes sense in the business context. A surrogate key is indeed a unique data identifier, but it is just a random number with no business meanings. Business key can be a customer code, flight code or order code. In some cases, e.g. customers are divided by countries, customer code can be repeated in different countries. Then a combination of customer code and country code is the business key of a customer.

Link table serves like foreign keys. It connects data objects. It contains also the meta data like load date, data source. Besides, it has the business keys of the connected data objects. In this way, all data object details are separated into a single satellite table.

As mentioned above, satellite table contains data details and meta data. All columns in satellite table should be nullable, because satellite table keeps all historical columns of the full data history. This nullable makes sure that satellite table keeps all data history and would not cause errors when a column is deleted from the source system.

Advantages

Auditable— One apparent advantage of DV2 is, DV2 tracks the full data history. DV2 has the ability to go back to a certain time point and gives you the data of that time point. This also helps to investigate data errors. The user can go back to any time point to check or undo any data changes.

Easy Integration — With Hub and Link tables, new data sources can be easily integrated into DWH. Thus, data analysts can easily analyze data from different data sources.

Additive — DV2 allows data engineers to develop and deploy DWH step by step. Because of the DV2 structure, one source table per deployment is possible. DHW can grow along with the business almost without lag time.

Agile Work — Aproject can be broken into small steps, i.e. one source table per step. Developers can easily prioritize data elements and deploy the most important ones to DWH. Thus, data users can get the needed data with the shortest waiting time. With the small deployment, the time spent on debugging can also be shortened.

Disadvantages

Table Storage — As one source table is splitted to two or three tables, a DWH has two to three times more tables. Because DV2 records every single data changes, the table size can be really big. Therefore, it requires also more storage space.

Query Performance — Due to the Hub-Link-Satallite DV2 structure, the queries can have more joins, which leads to low query performance. If the joined table contains frequently updated data, the table size can be extremely big, which further negatively impacts the query performance.

Summery

Therefore, situations in which you can gain benefits from DV2 are when you have multi-source systems and constantly changing relations. Or data tracking and data audit are important to you. If you don’t need historical data, DV2 may not a good option to you.

--

--

Dakai Zhou

A Python developer and data engineer based in Germany.