The Data Vault Guru: A Pragmatic Guide on Building a Data Vault
Why Write a Book?
There exists plenty of literature on the web about data vault modelling and often the message and methodology are conflicting, designed based on old school principles of data delivery or particular understandings of Agile delivery.
As a data professional with experience in ETL/ELT and data modelling, I felt something was needed to bring both viewpoints of data delivery together, one originating from data automation and the other from data modelling. They are not the same but if a data vault is properly delivered it must be cognizant of both.
Data vaults continue to evolve and communities of data vault modelling practitioners exist around the world. Forums and meetups focus on how data vault modelling has been applied to specific situations and tools. However, the problem with these resources is that attendees often don’t have access to the tools or may never encounter the unique situation. Data vault models are delivered with a partial view of the standards. Although data vault implementations have evolved, the standards remain consistent.
Although the book is a written account of what I consider to be an almost complete implementation of data vault modelling, it has to be said that reading about data vault and getting trained in it does not necessarily mean you know how to deliver a data vault. The book takes a platform-agnostic view and describes the intended delivery and modelling patterns of a data vault.
A data vault is not designed to model your data, it is designed to model and integrate all of your data! The beauty of a data vault is that once you have established the standards and understanding, you can build onto the data without having to change any of the existing data. You simply add to it by adding data sources — vertically to existing structures and horizontally as new structures.
Often, once you commit to a data vault, some considerations are not thought of until you get there, like what the categories of business rules are, how to build a business vault, or how to get data out of the data vault!
What Is in it for the Reader?
· This book discusses the levels of architectures present in any data-centric enterprise; how to view business processes and how they are modelled into raw and business vault.
· The book also discusses data latency, how that pertains to persistent vs transient staging, and whether or not creating views for a business vault is a good idea.
Relevant types of business and non-business keys and their relationship to data vault, themes like surrogate hash or sequence, smart, natural, driver and zero keys.
How to think about time in the data vault context through an insurance example by comparing business dates, data packaging dates and load dates as well as the three forms of time: discrete, evolving and recurring
· We get into the classic Raw Vault structures and also discuss naming standards, the DV-tags (metadata) expected in each data vault structure, recommended indexing of each. As for the three model structures themselves…
o Hubs — we present what they are, the standard SQL to populate a hub table, sample structure, defining what the standard business key treatments are and passive integration. How to apply variable business key treatments and business key collision avoidance strategies. How to load composite key business entities into a simple key entity hub table
o Links — we discuss what the unit of work is, how to build link tables either as regular links, same-as links, hierarchy links or a combination of those with an example on why you would or would not include a dependent-child key in a link table. All with the standard SQL to populate a link table.
o Satellites — the grain of change/snapshot-data we get from source systems. From single record grain to multi-grain with dependent-child keys, how to include multiple updates to a business entity in a single batch, how to implement a multi-active satellite and satellites with advanced data types like arrays and structs. Importantly how to split satellites and why, which content will be loaded to link-satellites and which to hub satellites and why you would split those. Not to mention how to handle personally identifiable content (PII). All satellite grains and types are presented with sample SQL on how to load them.
· Change data capture structures are also discussed, how you would load CDC and near-realtime content. As well as how to model reference data into data vault.
· The content derived in the analytics platform is stored in your Business Vault because it was sourced from Raw Vault but populated using the same automation templates as Raw Vault. The book shows you how to build a BV-link and BV-satellite and how to ensure data and business rule lineage while tying that content to a data governance tool
· We cap off the data vault modelling section with a decision tree on how to decide what artefacts to build and at what grain.
· The modelling section is completed with examples, SQL, structure and considerations for
o Record Tracking Satellites
o Status Tracking Satellites and
o Effectivity Satellites
Comparing where and why you would use one peripheral satellite over another.
· From the automation perspective we discuss
o the three loading patterns for hubs, links and satellites
o loading patterns for PITs and Bridges
o how to load to shared hubs
o orchestrating the load from staging content to raw vault,
o orchestrating business vault,
And we include code for a test automation framework!
· The book dedicates a chapter to data-driven timeline correction using a variation of the record tracking satellite that is decoupled from RTS but is capable of delivering timeline correction to all data vault artefacts, EFS and NH excepted. The chapter includes the SQL to populate and correct, scenarios and caveats to using the pattern. A sample of which is available here: https://bit.ly/3y4mUdV
· Getting the data out, we discuss the SQL needed, patterns of data vault queries, automating views over the standard data vault tables and automating views over a hub/link and all its satellites together for all data vault satellite table grains (multi-active/dependent-child/intra-day/regular)
· Query assistance structures and patterns are described and discussed for point-in-time (PIT) and bridge tables and where ghost records come into play.
· We follow that up by discussing data-driven dimensional modelling that is backed by data vault as the data warehouse
Towards the backend of the book we get into data warehouse migration towards a data vault with a decision tree on how and what to migrate plus how to deal with content overlap. This is followed by a discussion on integrating the various levels of data models into a single data vault model (enterprise, industry and application models).
· We also present variances of data vault, a metric vault, JIRA vault and a schema vault designed to support data vault automation. It powered this implementation, bit.ly/2YUw1xT
· As far as tools goes the book includes a checklist on what every data vault artefact must include, how to automate loads to them. A model scorecard for data vault inspired by Data Model Scorecard by Steve Hoberman and a guidance on how to Mob Model and how to setup a model register for collaboration and governance. And lastly a data vault automation tool scorecard.
The examples and samples in the book have a master data management flavour on how to use MDM as a source.
Who would benefit?
· Data Architects considering data vault,
· Data Vault Modellers for a reference guide,
· DevOps engineers tasked with building a data vault,
· Solution Architects designing where data vault fits,
· Data Analysts on how to query and use data vault,
· Data Migration specialists on what to consider migrating to data vault.
Where can you find the book?
If you want to know every detail on how to build a data vault, this book is your answer.
FR: https://amzn.to/37yfnKl
ES: https://amzn.to/3jl5tOr
#datavault #datawarehouse #analytics #datamodelling #thedatamustflow
Where to get more Data Vault?
A Data Vault Alliance exists that brings together Data Vault professionals from around the globe with decades worth of experience from delivering the classic batch-oriented data warehouse to real-time delivery. Discussions are both business and technical oriented and often the topics cover even the not-yet popular technical platforms as Data Vault is being used to deliver the modelling advantages you would expect!
#doitrightthefirsttime
You could also keep up to date with my blogs that are usually inspired by customer conversations or problems I try to solve for a users on DVA; the book goes into the SQL as well as a lot more detailed examples, you could practically lift the code from the book, parametrise it and deploy it.
- Data Vault 2.0 on Snowflake…To hash or not to hash… that is the question — an idea on how to deploy data vault in Snowflake given that Snowflake does not have indexes, bit.ly/3rH7gS5
- Data Vault Mysteries… Effectivity Satellite and Driver Key — the two concepts are intertwined, the article explores where they fit, the book contains the SQL on how to build it, and even query it, bit.ly/3lQosU9
- Data Vault Elevator Pitch, bit.ly/3qSmjIz
- Data Vault Mysteries… Zero Keys & Ghost Records — a Data Vault 1.0 practitioner are not familiar with this concept or the difference between the two, the article dives into their place and the book provides SQL on how to implement them, bit.ly/3ccwMcq
- Seven Deadly Sins of Fake Vault — an article designed to stir controversy between DV1.0 and DV2.0 practitioners. On the one hand you have modelling designed by a community not cognizant with the original intentional of the methodology, on the other you have a methodology designed to scale, the book is based on a methodology that scales bit.ly/3lGvevA
- Data Vault Mysteries… Business Vault, a common question when delivering a data vault is where the BV fits in, how to build it. Yes it doesn’t just exist, it must be designed and built, just like the source applications that populate raw vault. The book goes deeper into actual implementations of a BV, bit.ly/3rfV7V3
- A Rose by any other name… Wait.. is it still the same Rose? The importance of passive integration is highlighted here, and how to apply it is described in the book https://bit.ly/3xlFK0s
- Apache Spark GraphX and the Seven Bridges of Königsberg, the right tool for the right job is highlighted in this post. The article dives into how a business requirement for recursion that scales was delivered on Big Data, to be honest it would have been much simpler on a platform that provided the functionality we needed! Here we explore this journey and how the outcome was delivered in Business Vault, bit.ly/3ezZ6Wh
- Data Vault has a new Hero, this is the pattern getting a lot of attention in the data vault community now, XTS. It is (at the time of writing) something being considered in dbtvault, a data driven approach to timeline-correction! It is not without guardrails / caveats! The article demos the use-case, the book dives into a lot more with advanced scenarios and the SQL to build your own XTS, https://bit.ly/3y4mUdV
- Bring out your Dead… Data, https://bit.ly/3o8To3m
- What does dbt give you?, bit.ly/3ok9PsF
- Passive integration explained…, bit.ly/3pTWCXP
- How I can get away without paying the Pied Piper… in Data Vault 2.0, bit.ly/3iBfAzh
- Advantage Data Vault 2.0, a straight up comparison to DV1.0 and why it would not scale, bit.ly/2II5fVt
- Building Data Vault modelling capability through the Mob, this is a quick and effective methodology to deploying a Data Vault 2.0 at scale, the book dives into more and more tools at your disposal to define a governance council that focuses on rapid deployment of Data Vault 2.0 models, bit.ly/2NiceTN
- Learning Data Vault is Like Learning How to Make Beer! Why not? DV2.0 is not that scary, the book compares learning data vault 2.0 to photography instead! Like beer, photography only has three key elements to learn that drives everything you need to know about photography. Master those and every variation of photography is based on those three key elements! https://bit.ly/2ZYGpJP
- Data Vault or: how I learnt to stop worrying and love Data Governance, a short take on the DataOps movement and how Data Vault 2.0 was already geared to deliver this, bit.ly/35xhVGL
- Business Key Treatments, marry this content to the idea behind passive integration and you can see why this approach is invaluable. Row-level business key treatment assignment is possible in a data-driven approach that ensures the raw vault continues to integrate data across source systems, bit.ly/3pgNJaK
Data Vault 2.0 is not complicated; and that is why it works. By keeping every component decoupled, and integrate like microservices and cloud architecture, Data Vault 2.0 can infinitely scale. There’s even a checklist (chapter 12) in the book on what a data vault 2.0 automation tool must deliver in order be DV2.0 compliant!
Like every revolutionary methodology, the book is needed to discard the noise in the industry and focus on the components that will scale, do it right the first time. I do hint at the focus on business architecture in the book, yes, how else would you define the enterprise ontology and taxonomy if you don’t think about the business that is paying for your data platform….