The 10 Capabilities of Data Vault 2.0 You Should Be Using
A Data Engineer’s Guide to Unlocking Creative Solutions with the Capabilities of Data Vault 2.0
Not all clients are created equal, and with this logic in mind, neither are the methods for data acquisition and ingestion.
One of the many challenges of working in data engineering is continuing to come up with creative solutions to solve your clients’ data problems. For example, if you have a client that cannot afford to implement CDC replication with a paid subscription or license, what would you propose instead? What if you needed to find a source that pushes data via Kafka, API, or file-based extract?
Over the course of my experience working with clients in the data engineering space, the number of source systems that are becoming candidates for data lake and data warehouse use cases has grown increasingly more diverse. This phenomenon, coupled with quality data from various data providers in the data marketplace, brings some unique challenges to the data engineering life cycle. While the ability to adapt to these ever-present challenges in data engineering is a necessity, it’s required (and equally important) to have data processors that can maintain, and retain, recently active records and history.
Since these data sources are not static, they can evolve during the development life cycle. Types of common evolutions to the data sources include adding or removing columns, renaming or updating tables, schema drift, etc. In this case, if one is considering the data acquisition and ingestion components of a data lake, it’s important to ask the question, “What would happen if a schema drift were to occur? How would this event affect my data and processes?”
Enter Data Vault 2.0
By and large, there has been a consistent (and good!) influx of information about the Data Vault model and methodology, which has helped various data teams handle various scenarios with ease.
For example, take Data Vault with its Hub-Link-Satellite approach. In my opinion, this combination offers agility by taking a Property Graph-like approach with nodes (Hubs), edges (Links), and properties (Satellites). As more and more source systems are added, or as the schema evolves, your underlying model remains viable and doesn’t get invalidated.
How is this possible?
Well, the short answer is that Data Vault methodology facilitates forward progression — just because a schema drift has happened, it doesn’t mean the existing data ingestion (ET) processes need to change.
Data Vault 2.0 has a number of other lesser-known or overlooked specifications (check out the full list here!), so I won’t be focusing on just the Hub-Link-Satellite example from above. Instead, I’m going to walk you through the ten capabilities that piqued my interest when I began to learn about Data Vault. If you’re looking to improve your data ingestion process, a great way to get started today is by familiarizing yourself with these capabilities.
1. Always Insert
With this capability, as sources are sending in any inserts, updates, and deletes, we maintain them a separate record. Think of it as kind of like what Kafka does in the way that a table is similar to a log.
By avoiding the update operation of “Load End Date,” it is much easier for it to be adopted by modern cloud object datastores like S3, Azure Blob, etc. Insert operations, in general, are much faster, due to the fact that loading times are consistent (for the most part).
2. Tooling and Data Warehouse Vendor
In a way, this also means that you may be able to get away with not adopting a CDC replication tool for some of your data sources. For example, the source database tables are small/medium-sized, and in this case, a full extract is acceptable.
The insert-only approach is also very friendly to modern cloud-based data warehouses and data lakes. Regardless of whether you choose a vendor like Snowflake, Databricks, BigQuery, etc., the underlying object storage does not have to necessarily update. For example, using the existing micro-partitions for Snowflake, a Parquet/Delta table for Databricks, etc.
This means you’ll experience faster loads and ingestion. Additionally, it’s important to note that DataVault has been adopted in non-SQL databases like MongoDB as well.
3. Active Records
To get the active record, we can use analytical functions like lag and lead. You can easily adapt this with Query Assistance tables like “Pits” and “Bridges.”
This might look like an overhead during reading, but my argument is that it eases the ingestion implementation logic in a simpler fashion with the ability to handle out-of-sequence or late-arriving records.
If querying the satellite is not meeting your SLA, I recommend that you adopt the “PIT” pattern. The “PIT” is technically a snapshot that can be materialized as often you want and preserved as long as you want. In fact, you can have the “PIT” be the same table, you would just add a column named “PIT-Snapshot-Date,” and use a partition/cluster on this new column.
4. SCD Type 2/3 Tables
One of the benefits I’ve observed with this type of table is that the needs of the SCD Type 2/3 can be added to the information data mart. You can even add these tables in the data mart later on in the delivery lifecycle, and you would still be able to have the historical records from day 1.
5. Out-of-Sequence Data
One common pain point in data ingestion, specifically with IIoT datasets, is receiving events that are out of sequence. It is typically a pain to identify older records, then take the time to update multiple records, reposition, and reorder them in a way that accommodates the late arrival record.
Another constant pain point is identifying and avoiding the need to load duplicate records from a stream. With the “Always Insert” based method, coupled with “HashDiff” columns, the aforementioned pain of repositioning does not need to be implemented which keeps the data ingestion process much simpler.
The “HashDiff” column, with some caveats, prevents loading duplicate records and maintains the ease and simplicity of ingestion. Additionally, if the active record implementation logic is using the analytical windows function, there is no need for it to change.
6. CDC and Replication Data Streams
When using the “Always Insert” approach, regardless of the data acquisition patterns such as CDC, file-based, streams, or API pull, the data ingestion pattern, as well as the underlying implementation, always remains the same.
I feel that this is a huge plus for data engineers. Without this functionality, it can be difficult to successfully implement data ingestion based on the type of sources and their replication pattern. This issue, coupled with source systems, drifts, and channels (Kafka, Rest, etc.), means that you have to accommodate various scenarios and error processing in the ingestion components. Additionally, you would need to write source-specific ingestion components.
Due to the complications of keeping track of each source system behavior and how their data ingestion pattern implementation changes, the ingestion process can be prone to miscommunication, misunderstanding, and potential error mines.
7. Loading Hubs, Links, and Satellites
With this specification:
- Hub and Satellite Sequence ID (deprecated)
- Hub and Link Last Seen Dates (deprecated)
- Adoption of hash keys
It is much much faster to load into the various entities, and a lot of parallelisms are achieved during ingestion in Raw Vault, which eventually leads to faster data availability later in the pipeline.
8. Hub as an Index
The hub is defined as “a unique list of business keys,” and it is technically an index. The records get inserted into the hub only for new entities such as new customers, addresses, or IoT devices. For example, with input data, if an existing customer needs an address change or it contains records that were previously in the hub, then it is not inserted again, nor is the existing record updated.
Technically, this means that you are only inserting delta records that are new entities, such as customers. If the changed record set does not have new entities, then no write operations are observed, which results in an improvement in ingestion speed.
9. Satellites Based on Data Classification
Due to the ever-increasing need for data governance, privacy and regulations, it is a must to keep track of and maintain PII data under secure datasets.
Satellites, which contain context data, like name, date of birth, sex, role of the entity (ex: customer), need not be preserved together. You could implement satellites that are split on data classifications. Therefore, you could store all PII-related context in one table and the non-PII data in another table. Using the Hash-keys, the full recordset can still be built even though the columns are residing in different satellites.
Another observation here is that if the PII columns are not changed, then, no new records are necessarily inserted in the satellite holding the PII data.
10. Links are Immutable
The link/relation is immutable, which means that once there is a relation between two hubs, they are always there and will remain unchanged over time. The Link-Satellite holds the information on “End-Date” (if applicable), and again, if there is no change to the relation, there is no write operation. The link table thus essentially is another index-like store.
I hope that you found these 10 Data Vault 2.0 capabilities as interesting as I did since there are plenty of effective and positive effects they offer to the data ingestion process. There are additional specifications that might be of interest like “Same-as” Link, “Exploration” Link, and many more.
If you want to read further on more characteristics of DV2, I would recommend this article: Do you know these 7 Characteristics Of Data Vault 2.0? Keep an eye out for my corresponding story as I reflect on my journey and experience with learning about DV2, along with some helpful tips for starting the adoption process.
Getting Started with Data Vault 2.0
My top recommendations for beginning your journey with DV2
3NF and Data Vault 2.0 — A simple comparison
Data Vault, from its first edition ‘1.0’ as a data modeling specification has evolved into a more elaborate version…
Digging into Data Governance and Data Modeling
Testing and validation are where data governance begins to play a massively important role. If you’ve documented where…
Hashmap on Tap | Hashmap Podcast
A rotating cast of Hashmap hosts and special guests explore technologies from diverse perspectives while enjoying a…
Hashmap Megabytes | Bite-Size Video Series
Hashmap Megabytes is a video series in which mega cloud ideas are explained in bite-size portions. Our data and cloud…
Ready to Accelerate Your Digital Transformation?
At Hashmap, an NTT DATA Company, we work with our clients to build better, together. We are partnering with companies across a diverse range of industries to solve the toughest data challenges — we can help you shorten time to value!
We offer a range of enablement workshops and assessment services, data modernization and migration services, and consulting service packages for building new data products as part of our service offerings. We would be glad to work through your specific requirements. Connect with us here.