Implementing a Data Vault in BigQuery

Daniel Zagales
The Startup
Published in
6 min readMay 19, 2020

In my last post I went over how we went about implementing a fitness program at Pandera, the need for a custom solution to handle a little friendly competition, and an architecture that would support that.

Designing a Fitness Leaderboard in GCP

In this post I want to focus on what a Data Vault model is and what it looks like in BigQuery. I am covering this first instead of the data pipeline as it really shapes the way I handle the messages and events that come in.

First I want to start off by saying Data Vault is extremely overkill for this scenario. It is really good for when you have high data velocity, auditing, and traceability requirements. We come across a lot of these scenarios, especially in financial services, healthcare, and other regulated industries, which is why I wanted to get a better understanding of some of the gotchas during implementation.

The main concepts in Data Vault are Hub, Satellite and Link tables. Hubs primarily store business keys and a hash of the business keys. Links store a relationship between hubs by capturing the associated business keys, their hashes, and a new hash of the business keys from both hubs. And Satellites store additional attribution about hubs and satellites. This is a basic explanation of Data Vault. But what this setup allows us to do is load data without much in the way of dependencies, so tables can be loaded in parallel, and not have to worry about much other than getting the data loaded. So for the purposes of the Strava data we had two entities, the athlete and the activity. This is a fairly simple relationship but it will demonstrate some of the things I learned implementing a Data Vault.

Lesson 1 — The number of tables involved. For those two entities alone I needed five tables:

  • activity_hub
  • activity_sat
  • athlete_hub
  • athlete_sat
  • athlete_activity_link

In most scenarios this could be as simple as one table or three. Since the scope of this implementation is very small the additional tables are not very noticeable, but on a large scale implementation it could be a hassle. This is typically why automation takes a key part in data vault implementations, and being that most of the patterns of these base tables repeat (hashing of business keys, hashing of attributes, inclusion of meta columns) this can be programmed with pretty low effort or by using a COTS product. But for the time being I will be doing this manually.

First though I need to model those tables.

Data Vault Model

I had to rework this model a few times in developing it, mostly because I worked on this project between 4:45 am and 6:30 am. Which brings me to: Lesson 2 — ironing out the data model pays dividends in the long run. To do this you really need to analyze the data ahead of time and make sure you have a good understanding of relationships, uniqueness, and the attribution. Once this is all done though you are in a good place, because one of the principles of data vault is to get the whole dataset into the data vault. What this does is allows me not to have to reengineer a pipeline because I now need an extra field. Get it all, because the addition of a field or ten during this phase is very low, but if you have to go back and add it becomes a pain.

With the model set up you’ll notice that each table has a few common columns, primarily the ‘_seq’ and ‘_load_date’ columns. These assist in a few different ways:

  1. Determining the latest record
  2. Being able to identify differences in records.
  3. Only focus on inserting records and not updating existing record

When I look at the ‘_seq’ columns, these are my hashes of the business key(s) that identify a unique instance of something. I used an MD5 hash because it is readily available in python and BigQuery.

  • athlete_hub_seq — hash of athlete_id, I struggled with this one mostly because the business key really should be the email address associated with Strava. It provides more contextual value than a random integer. However with the scope I have for the api it is not available. Ideally if I did have the hash of the email I could then relate data to other company entities which is powerful.
  • activity_hub — hash of activity_id, unlike the athlete_id, I am fine with this being based on a random integer. So many elements of an activity can be edited in the application that the only constant unique is this ID.
  • athlete_activity_seq — hash of athlete_id and activity_id. This hash provides a combination of the unique occurrence of an athlete performing an activity.

Because of the above data vault features, my pipeline into BigQuery can be very simple, I can focus on only inserting records and I have a clear method of distinguishing duplicates and determining my latest version. This is where the benefits of Data Vaults speed in loading comes from. No need to look up to see if a record exists already, or search for an existing surrogate key, just insert and go.

This does leave me with a bit of a gap and:
Lesson 3 — Data Vaults are not suitable for a reporting layer. Which means….more tables or in my case views. I need to build a data mart on top of the Data Vault to service my visualization tool.

Information Vault

You can see that I have two datasets. The first being my data vault, and the second being my information vault to serve as a reporting layer. This is a very important aspect as it pertains to BigQuery and data security. Today you are unable to secure individual tables or data elements, security is all at the dataset level. This is not a huge issue because you can simply create datasets to serve various security requirements.

So my view layer consists of one singular row level table, fact_activity, where I have applied some logic:

WITH act_sat_latest as (
SELECT activity_hub_seq,
MAX(sat_load_date) as latest_load_date
FROM strava-int.strava_datavault.activity_sat
GROUP BY activity_hub_seq)

SELECT DISTINCT ath_s.athlete_hub_seq
, ath_s.firstname
, ath_s.lastname
, act_s.*
FROM act_sat_latest asl
JOIN strava-int.strava_datavault.activity_sat act_s
on (asl.activity_hub_seq = act_s.activity_hub_seq
and asl.latest_load_date = act_s.sat_load_date)
JOIN `strava-int.strava_datavault.athlete_activity_link` aal
on act_s.activity_hub_seq = aal.activity_hub_seq
JOIN `strava-int.strava_datavault.athlete_sat` ath_s
on aal.athlete_hub_seq = ath_s.athlete_hub_seq
WHERE act_s.delete_ind = False

This will serve as a denormalized springboard for my other aggregate views, each of which serves a specific purpose of calculating activities and time spent for each week or at an overall level, additionally there’s a stats table where I collect some of the ancillary metrics like miles travelled, calories burned, etc.

As a couple of quick notes on the BigQuery implementation. I am not doing any partition or clustering in this instance. The data is small enough to where I will not get any real performance or cost savings by implementing it.

With all the structures put into place I can start building out the actual pipeline, which I will cover in my next post.

As a whole, I really like the Data Vault method. Just like other methods there are clear guidelines that can make it successful. And it is important to really understand the overhead involved in that success. However given the resources a platform like Google Cloud provides, having the ability to have massive parallel processing out of our data pipelines is becoming simpler with a method like Data Vault.

If you have implemented a Data Vault on Google Cloud, let me know about your experience in the comments!

--

--