A lightweight Data Vault automation framework with Snowflake

Conrad Fritzlar
ProSiebenSat.1 Tech Blog
7 min readMar 23, 2020
Picture of three people in front of a TV. The word “Joyn” is in the middle of the picture.

Are you one of the data pioneers wondering how to implement Data Vault in a real-time environment with as little manual work as possible?

Well, luckily you are not alone.

There are plenty of technologies and tools to choose from.

In this blog article, you’ll learn how the Joyn team is exploring the implementation of Data Vault in real time. We share our insights on the decision-making process and the transition from a stable but outdated BI landscape to cutting-edge technologies. With Apache Kafka, Tableau and Snowflake at the forefront, our journey takes us through the adoption of the Data Vault 2.0 methodology for a more flexible and standardised approach.

Learn why we chose Data Vault at Joyn and how it solves the real-time challenges in our evolving backend. Learn more about our automation dilemma, existing solutions and why we ultimately considered our own solution. And the best thing about it: we open-sourced our code on GitHub to make this a collaborative effort.

Slow processes lead to workarounds

Before talking about the fancy stuff, let’s take a look at how our BI landscape was shaped a couple of months ago. It involved Pentaho Data integration as an ETL solution, an On-Premise Inmon-style Postgres data warehouse, some Maria DBs as a main data source, and other, little dusty components. Yet it was running quite stable without major issues.

But here’s the thing: It took the developers approximately a year to initially build it and a couple of months again to get it to an acceptable state. Repeating processes were only standardized to a certain degree and modeled by hand every time, thus depending on the individual developer’s style and taking a very long time. Let’s not mention that performance was an issue in many situations. The same goes for data that needed to be integrated as the business evolved. And business analysts usually don’t want to wait until new data has made it to the data marts. What they will do instead is build their own workarounds based on raw data to provide insides to the business as soon as possible.

All elements working together

When the decision was made to create Joyn as a completely new video streaming platform this was also the chance for the BI department to switch to a new technology stack. Apache Kafka became the new main data source since the microservices in the backend push their data to it. Tableau was the reporting tool of choice and Snowflake, due to its unique features and high performance, was the data warehousing solution. We also decided to use Data Vault 2.0 for our core DWH layer. The picture below gives you a high-level overview of all components.

Joyn Data Warehouse Overview with Data Vault 2.0, Kafka and Tableau
Joyn Data Warehouse Overview with Data Vault 2.0, Kafka and Tableau

A quick justification of why we decided to go for the Data Vault approach:

  • our backend is still a work in progress while we build the data model, therefore a lot of changes in the data structure can be expected that we need to adjust to on the spot
  • de-coupled parallel loading is crucial in a real-time environment to ensure that load processes can keep up and data stays fresh
  • we simply don’t want to make the same mistakes mentioned above and automate as much as possible in the load process, applying common standards

Automation of the load process

So far so good. Now let’s talk about the automation of the load process itself. First of all, we needed to make the classic make-or-buy decision.

Usually, the common opinion in our team is to use an existing proprietary or open-source solution and integrate it into our ecosystem. Why would you build something yourself when someone already has a stable solution for it? Unless you are a software company of course.

There are a couple of vendors out there that support you with data vault automation. A selection can be found in the table below. Note that automation within those solutions is not necessarily restricted to development but can also be applied to documentation, visualization, deployment, testing, etc.

The concepts these tools are using also vary quite a bit. On the one hand, there are SQL code generators that help you build the necessary database structures and load processes. On the other hand, extensive software solutions exist that support the whole lifecycle of your data from the source to the data marts. Furthermore, there are differences concerning the possibility of integrating with existing ETL / ELT products like Talend or Informatica for instance.

Due to our architecture, the automation tool had to meet the following basic requirements:

  • ability to load data from a Kafka source in (near) real-time
  • ability to import metadata from Kafka schema registry / detect schema changes
  • support for Snowflake as a source and target with full SQL support
  • support for Data Vault 2.0 methodology with hub, link, and sat load templates and hash key generation capabilities
  • support for profiling and processing JSON data structures
  • data lineage functionality and dependency visualization
  • support for snow pipe or comparable file-loading functionality
  • preferably running in the cloud, no additional on-premise infrastructure needed

Snowflake was the answer

As it turned out none of the tools we have looked at had full support for all our requirements at the beginning of the year 2019. Furthermore, some of them added complexity to our existing stack because additional infrastructure like servers, databases, or scheduling solutions where needed. Of course, usually, vendors are supportive and offer to develop missing features together with you. But this is still a time-consuming and potentially expensive process. Also keep in mind that our solution needed just a fairly simple load process, with a single main data source and no big migrations or legacy systems involved.

That’s why we needed to take a closer look at what is necessary to build an automation solution ourselves. In this process, it was very helpful for us to talk to Roelant Vos, who has an open-source project dealing with data vault automation. He gave us valuable insights on how to handle metadata for raw vault loading, templating, and virtualization of DWH layers. Unfortunately, we weren’t able to use his tool stack out of the box.

As we were already using AWS it seemed obvious that we would also take a look at tools provided in this huge ecosystem. We narrowed it down to AWS Lambda functions in combination with AWS Glue. However, when we were about to dig deeper into the actual setup we stumbled upon a different solution. Snowflake had just released new features including Table Streams and Tasks for their database. In combination with JavaScript-stored procedures executed on the database, this seemed to be a lightweight solution and we went for it.

Data Vault Automation software and approaches
Data Vault Automation software and approaches

To get a first impression of how the whole load process we developed is orchestrated, we should take a look at the workflows involved.

Let’s start with the source system and the connected interface. The Snowflake Connector for Kafka is pushing data from Kafka topics to Snowflake. It is running on an Aiven Kafka Connect cluster and supports both JSON and AVRO formats. This means every topic that exists in Kafka (and is of interest to the BI department) will have its own table in the persistent stage schema of the Snowflake DWH. Those tables are continuously updated and contain a metadata column and the JSON payload itself in another column.

The next task is to get this data in our raw data vault with as little manual effort as possible. With the bullet points below we provided the ingredients that were used to achieve exactly that.

  • Snowflake JavaScript stored procedures with templates for Hub, Link, and Satellite creation and loading
  • A metadata schema with several tables to specify the source to target mappings, business keys, hashing, relevant attributes, etc.
  • Snowflake Streams to easily identify changed data in stage tables
  • Snowflake Tasks to schedule the statements defined in the stored procedures

The following diagram gives you an overview of how those components act together:

Data Vault Framework Components at Joyn using Snowflake
Data Vault Framework Components at Joyn using Snowflake

It is important to understand that we only used methods and features that Snowflake has built in, thus leveraging their computing power. No additional software is needed to process the data into the Raw Vault. The manual work is usually limited to adding or changing metadata in the corresponding metadata tables to adjust the load process.

Join us on this ongoing journey of discovery into the world of Joyn and the technical challenges and solutions our team is working on every day.

+++ This article was already published by the author on the former Joyn Tech Blog and was moved to the P7S1 Tech Blog in March 2020 and integrated here. From now on you will find all the content regarding Joyn on the P7S1 Tech Blog. +++

--

--

Conrad Fritzlar
ProSiebenSat.1 Tech Blog

Senior Software Engineer in the data team at Joyn. Interests: Data Warehousing (especially with Snowflake), Data Vault 2.0