Using a Data Vault to Support Data Science

Veronika Durgin
IndigoAg.Digital
Published in
7 min readMar 3, 2021

What’s all this mess?

Data is inherently messy. It wasn’t created that way on purpose, but due to many different reasons that is how it comes into a staging area or a Data Lake. It resembles a pile of Legos that used to be beautiful Lego sets but completely fell apart … and someone recycled the instructions.

Making sense of data in this chaotic state is complicated and time consuming. The long, tedious, and painful process of finding necessary data elements, trying to join them together, and perform data profiling and cleansing is frustrating. Trust in data is usually lost along the way.

At Indigo Ag, we load data into Snowflake from over 100 different data sources. Some of these sources contain similar data while others are unique. Prior to adopting the Data Vault 2.0 methodology, every team was responsible for understanding the intricacies of each data source, cleansing it, and joining it to additional data in order to shape it to what they needed. There was a lot of complexity, redundancy, and inconsistency. Data Scientists and Data Analysts spent too much time on data prepping and cleansing instead of doing what they do best — discovering how data can be used to answer questions and solve problems.

Data Vault 2.0 to the rescue

Data Vault 2.0 can help solve or at least alleviate many pain points mentioned above. If you are new to Data Vault and its concepts, Data Vault Alliance is a great place to start. For Snowflake customers, Kent Graziano wrote a series of blog posts about optimizing Data Vault on Snowflake.

Here are some reasons why I think the Data Vault 2.0 methodology not only supports but also empowers Data Science:

Document the Business Semantics

The modeling aspect of the Data Vault methodology is based on the top down approach that focuses on understanding and modeling business concepts independent of the systems. Business Keys (Hubs), relationships between those business keys (Links), and descriptive attributes (Satellites) help build a model that is easy for others to understand and explore without needing any deep knowledge of where the data came from.

Naming conventions are very handy in providing end users some context about the contents of the Data Vault tables. Here is an example of a naming convention we use at Indigo. The pattern generally consists of 3 parts.

  1. Prefix:
  • SAT_- satellites for HUBs
  • SATL_- satellites for LINKS
  • HUB_- Hubs
  • LINK_- Links
  • SML_- Same as LINK
  • HRL_-Hierarchical LINK

2. Middle: describes a Business Concept or a relationship.

3. Suffix: only applicable to Satellites and describes the source system or some other special purpose.

For example: HUB_CROP, SAT_CROP_SAP, LINK_CROP_FIELD, SATL_CROP_FIELD_SAP

We generally use views to project effective dates to the effectivity satellites (see code example below). An effectivity satellite table in the Raw Data Vault is suffixed with _RAW.

CREATE OR REPLACE VIEW SAT_CROP_SAP
AS
with READ_FROM_RAW
as (
select …..
,lead(EFFECTIVE_FROM_DT) over (
partition by HUB_CROP_KEY order by EFFECTIVE_FROM_DT asc
) as EFFECTIVE_TO_DT_RAW
from SAT_CROP_SAP_RAW
)
select …
,EFFECTIVE_FROM_DT
,ifnull(EFFECTIVE_TO_DT_RAW, '2099–12–31'::timestamp_tz) as EFFECTIVE_TO_DT
,(EFFECTIVE_TO_DT_RAW is null) as IS_CURRENT
from READ_FROM_RAW;

Using JSON in the VARIANT data type in Snowflake for the record source column of a hub allows us to add additional metadata about source systems rather than stuffing all that data in a standard VARCHAR that would be harder to parse later. Here is an example:

parse_json(
'{"Source System": " Name of the Source system",
"Source Table": "Name of the source table",
"Source Columns": {
"Business Key column name": ["Source System Column Name"]
}
}')::variant as RECORD_SOURCE

Really Document the Data

Naming conventions help at a glance but they do not replace real documentation. Part of the Data Vault implementation is learning the business as well as data sources. Multiple data quirks are discovered in the process of performing basic profiling and writing transformations. We ensure that we document all of our discoveries in a centralized place (e.g Confluence). Our documentation also includes everything we know about a data source, logical diagrams, the Data Vault ERD in Lucid and/or sqlDBM, and YAML driven data dictionaries with column descriptions that are added to Confluence as part of deployments. This goes a long way to help onboard someone new to the data set quicker.

Preserve the Integrity of the Source Data

One of the critical patterns of the Data Vault methodology is to only perform inserts into the Raw Data Vault of all data as is all the time. Data scientists do not need to worry that anything is lost in the upstream transformations. Historical changes are preserved and can be analyzed with the help of effectivity satellites that enable time travel. All attributes, from all sources, are loaded to the Data Vault on the first pass without the need to wait, while requests for additional data are handled by data engineering. This also applies to semi structured data, which could easily be stored in Satellites in Snowflake with the variant data type (no need to parse it out then pick and choose what to store).

Link ‘em

In cases when source system keys have to be used as business keys, and data is loaded from multiple sources, “Same as Link” is a way to deduplicate data and associate data from different systems to each other. This is a very useful pattern of the Business Data Vault that provided us with flexibility to perform matches among data from different sources. We had a specific use case for matches to be done based on different algorithms ranging from fuzzy string matching to manually created reference tables. To allow data science to pick matches that satisfied their criteria, we created a satellite for the Same as Link that contained attributes that described each match so it could be used for filtering.

Cherry on Top

Life is not simple and neither are data systems that try to model it. When similar data from different sources comes with different grains, a Data Vault implementation following the design standards easily handles it with the use of multiple links and satellites. To make things easier for data consumers, we join all of these tables into a common Business Vault view where business rules are applied to help align attributes across sources.

These views are historical and have an additional attribute that identifies the source system for each record. We use these views to identify and understand data quality issues across multiple sources, find patterns, and create algorithms for data consolidations.

People Relationships Matter Too

We all know that the best way to resolve data quality issues is by fixing them at the source. But we very rarely have access to source systems to do so. At Indigo, we have a dedicated team working on Data Vault allowing us to build relationships not only with the internal data owners but also with external data vendors. Because of these relationships, we’ve been successful at communicating data issues to appropriate owners who are able to fix them at the source. The end result is that many data quality issues are accounted for and resolved prior to bad data getting an opportunity to ruin great insights.

Best Friends

Anyone who does data analytics or data science looks at data with a lens and purpose slightly different from an engineer building a Data Vault. Collaboration between these functions is incredibly valuable, both while the Data Vault is modeled and built out, and after, when data is profiled even deeper and more potential issues are discovered. Close collaboration allows both parties to share subject expertise, identify gaps that require additional data sources, identify and document undiscovered data quality issues, and create even better documentation.

Calling all Master Builders

Data in the Data Vault is organized by business concept (i.e., hubs with business keys), it is historical and auditable (i.e., effectivity satellites), documented, and it is minimally cleansed where appropriate (i.e., business vault).

With the Data Vault on Snowflake, all of the data prep work is done and the data is stored in one central platform. Now it is ready to be used to solve complex problems that will positively impact the business.

Calling all DATA Master Builders!

https://ew.com/tv/lego-masters-best-creations/?slide=6625659#6625659

Acknowledgement

I want to extend a huge thank you to Kent Graziano for all his encouragement, support and limitless patience while editing my writing.

--

--