Overhauling our Data Warehouse with Data Vault

Published in

Red Ventures Data Science & Engineering

6 min readSep 29, 2017

How we used data vault to create a backroom processing layer in our data warehouse architecture.

At Red Ventures, we model our analytic data in a Kimball-style dimensional model that is queried by analysts and used by SSAS cubes. Modeling business processes as fact and dimension tables provides exceptional performance optimizations over a highly normalized schema, as well as a simplified model for analytic queries.

We have a legacy data warehouse architecture where data from disparate source systems (primarily MySQL databases) is loaded via Pentaho ETL jobs directly into the dimensional model in a SQL Server database. This worked well when we had simpler reporting requirements and fewer source systems generating less complex data, but it has not scaled to meet the demands of an increasingly complex environment. We required an intermediary layer between our source system data and dimensional model that provided the following:

Extensibility and modularity
Data lineage
Integration of disparate source systems

The two prevailing data warehouse methodologies, Kimball Group and Corporate Information Factory (Inmon), both advocate for a staging layer where ETL processing occurs, as well as a presentation layer used for ad-hoc analytical queries, reporting, and BI tools. The Corporate Information factory architecture includes an intermediary normalized database in-between the staging and presentation layers. This intermediary layer, referred to as the Enterprise Data Warehouse (EDW) holds atomic, time-variant, historic data that does not have business rules applied. The application of business rules occurs during the transformation to the dimensional model. Given our requirements, we opted to create this intermediary normalized database using the data vault methodology.

Data Vault Modeling

The data vault methodology is not a new idea; Dan Linstedt, the creator of the method, started developing it in the 1990’s. As defined by Linstedt in an article published to the Data Administration Newsletter in 2002:

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of enterprise data warehouses.

Data vault modeling is surprisingly simple given the level of flexibility it administers. Before diving into the structure of the data vault model, it is important to understand the concept of the business key. All business processes generate some sort of identifier used to uniquely identify a transaction in that process. For example; an ordering system will generate an order ID that is used to track a particular order, a CRM system will generate a customer ID for each customer that it tracks, and so on. These business keys form the foundation upon which the data vault methodology is built.

The data vault model contains three main entities; hubs, links, and satellites. There is a single hub table for each business key, and it includes some metadata about these keys as well. In some cases, a business entity cannot be identified by a single key, so composite keys are allowed. Link tables connect two or more hubs - they define the relationships and interactions between entities. Satellite tables contain non-identifying descriptive attributes about hubs and links.

The diagram above demonstrates the three types of entities and how they relate. In this case, the business process being modeled is an order processing system. In the next few sections, we’ll describe each of these entities in detail.

Hubs

First up is the hub. In the order processing model, the order hub includes the unique list of order IDs generated by the order processing system. Note that the table includes OrderSK and OrderID. What’s the difference between these two fields? OrderSK is a hash of the business key and is used as the primary key of the hub and the foreign key for any link table referencing the hub. The main reason that this hash key is used is for performance optimizations — joining two tables on a single column is faster than joining on multiple columns. As mentioned before, the hub can include multiple columns that comprise the business key. Finally, the RecordSource column is used to track the source system from which the key was generated.

Satellites

Satellites are used to store non-identifying, descriptive attributes about a business process, entity, or relationship. In the example above, only the ProductHub has a satellite, but it is perfectly acceptable for a link table to have a satellite as well. Satellites are append-only, so changes to the descriptive attributes of an entity are tracked using the ValidStartDate and ValidEndDate attributes. This is analogous to a Type II slowly changing dimension in the dimensional model, and makes modeling dimensions quite simple.

One of the powerful features of the data vault model comes with the ability to split satellites by source system. How might this be useful? Suppose that your organization uses a proprietary CRM application, and data from that application is loaded into the data warehouse and integrated with other data sources. One day, the organization acquires another company that uses a different CRM application. As data warehouse engineers, our job is to integrate these two systems. Ultimately, we are creating data models for the CRM business process, not a specific CRM application, so we want our model to be as source-system agnostic as possible. The _customer_ entity in these applications shares some attributes, but each one has attributes that the other doesn’t. Instead of using a single satellite table with nullable columns for the attributes that don’t exist in the other system, we can create _two_ satellites, one for each source system. Each satellite is still related to the same hub table, but this approach allows us flexibility in dealing with disparate source systems that track the same business processes and entities.

S1 is used for the original CRM, and S2 is used for the new CRM. Each share Attribute A, but have different attributes as well.

Links

Link tables tie everything together and represent relationships between hubs. The link table includes a hashed primary key, the hash key of any hub connected to the key, the load date, and the record source. Link tables can connect as many hubs as necessary, and effectively model transactions that can be used to populate fact tables.

Our New Architecture

So where does data vault fit in our data warehouse architecture at Red Ventures? As mentioned earlier, the dimensional modeling methodology is still the ideal solution for handling analytic queries. Our problems were not with the dimensional model, but with an inflexible (or nonexistent) intermediary integration layer. Data vault fills this gap.

Our data warehouse architecture now has three-tiers, each serving a distinct purpose in the analytic pipeline. Data from disparate source systems lands in a staging layer via an extract and load process (no transformations), where SQL Server stored procedures orchestrated by Spark (a post on this is coming soon) transform the data and land it in the data vault database. This process integrates disparate source systems, applies the necessary data lineage metadata, and cleans the data. The data vault layer contains historic data, and could be used to re-populate the dimensional model if required. Another set of stored procedures (using the same orchestration engine) load the data vault data into the denormalized dimensional model containing fact and dimension tables.

While this architecture presents more ETL overhead, we have found that it is a worthwhile compromise. Business users are only exposed to the analytics tier and the complexity of the backroom processing and ETL layer is abstracted away completely. The data vault layer can ultimately be thought of as Lego blocks that can be combined into any dimensional projection required by the business.

Recap

Data vault provides the scalability, modularity, source-system integration, and data lineage required by our data warehouse and analytic requirements. Using data vault as a source for our dimensional model has given us immense flexibility and the ability to iterate much quicker. If you’re interested in learning more about data vault, check out Building a Scalable Warehouse with Data Vault — this is an incredible resource on all things Data Vault, and was instrumental in the process of rearchitecting our system.

Interested in working on challenging data problems? Check out opportunities at RedVentures.com.