Data Vault 2.0 + Step Functions + Athena + S3 = Data Warehouse/Data Lake?

Daniel Mannino
The Startup
Published in
8 min readJan 23, 2021

--

Introduction

This article describes a proof-of-concept architecture for a data warehouse / Data Lake built in AWS using:

  • S3, to store the data
  • Athena, to ingest and transform the data
  • Step functions, to orchestrate the data ingestion and transformation
  • Data vault 2.0, as a methodology to implement and design the data warehouse
architecture

The article assumes that you have a good knowledge of AWS (S3, Athena, CLI, and Cloudformation).

The article starts with some disclaimers, continues with describing the motivation for such design, what we are going to build, and then how to build it. Conclusions and next steps close the article.

The article is part of a serie. You can find the second part here.

Disclaimers

All views expressed in this article are my own and do not represent the opinions of any entity whatsover with which I’ve been, am now, or will be affiliated.

The design proposed in this article is nothing more than a proof-of-concept. More tests and work have to be done before declaring this design production-ready. I will further test and optimize the design in the coming weeks and describe my finding in a future articles.

Motivation

I’ve spent all my career, 20 and plus years, building analytical systems. When I started there were few options for building such systems. I’m thinking about methodologies like the one proposed by Ralph Kimball (https://en.wikipedia.org/wiki/Ralph_Kimball) or technolgies like Oracle, AbInitio, Informatica Powercenter, or Teradata to name a few.

Two innovations brought data warehousing to next level.

The first is the data vault 2.0 methodology (https://datavaultalliance.com/) that gave us an incredible powerful way of designing and implementing data warehouse.

The second is the advent of the cloud that brought us new tools to build incredible scalable analytical systems. Three cloud characteristics are enabling this:

  • the separation of compute and storage
  • the virtually unlimited storage capacity
  • an elastic compute capacity that can rapidly scale up/down

It is interesting to notice that few cloud analytic platforms are actually able to fully leverage those 3 characteristics. My favourite among them is Snowflake (https://www.snowflake.com/).

I’m a very curious person and I’m always looking for new technologies to improve the way we build analytical systems. One week ago, I had a look at the service “AWS step functions” (https://aws.amazon.com/step-functions/) and I asked myself: can I use it to orchestrate the load of a data warehouse? The answer is yes and I wrote this article to share my experience with you.

What we are going to build

We are going to load contact information from a CRM system into the raw layer of a data warehouse / Data Lake designed using the data vault 2.0 methodology.

The contact information is stored in a csv file with the following columns:

  • id, the id of the contact
  • account_id, the account id of the contact
  • first name
  • last name
  • address
Sample content of the contact.csv file
Sample content of the contact.csv file

We are going to load the contact information into 1 hub and 1 satellites: the contact_hub and the contact_sat tables.

The contact_hub table has the following attributes:

  • id, the contact id
  • load_ts, the timestamp of the record load
  • src, the name of the source system (‘CRM’, in this case)

The contact_sat table has the following attributes:

  • id, the contact
  • hash, the md5 of the contact attribute (it is used to implement the SCD2 logic)
  • load_ts, the timestamp of the record load
  • first_name
  • last_name
  • address

This design is OK for a POC. In a real implementation I would move the address to a new satellite since it has a different rate of change than the first and last name.

Both tables are stored in S3 as parquet files and their metadata are stored in the AWS Glue catalog.

Orchestrating the data loading is the duty of a step functions with 2 parallel states as shown in the picture below

Step function that loads the data

The step function calls Athena to execute two queries that actually load the data into the target tables.

We are going to execute an initial data load and then a second one to test the correctness of the data load logic.

How to build it

The github repo https://github.com/TheDataStrategist/s3-athena-stepfunction-dwh contains all the code and data with the exception of the step function that we will build using the AWS GUI.

Let’s start by creating a directory and cloning the repository in it

git clone https://github.com/TheDataStrategist/s3-athena-stepfunction-dwh

The file dwh-stack.yaml contains the Cloudformation template for:

  • the S3 bucket that will contains all our data
  • the Glue databases and tables
  • the Athena workgroup

The template contains the parameter S3Location that determines the name of the S3 bucket. Since S3 bucket names must be unique, choose a S3 bucket name and create the Cloudformation stack using the following command:

aws cloudformation create-stack — template-body file://dwh-stack.yaml — stack-name dwh — parameters ParameterKey=S3Location,ParameterValue=”<S3 bucket name>”

I’ve choosen the name “medium-dwh” for the S3 bucket and therefore the command is

aws cloudformation create-stack — template-body file://dwh-stack.yaml — stack-name dwh — parameters ParameterKey=S3Location,ParameterValue=”medium-dwh”

Once the stack is created the Glue data catalog should contain the following databases/tables

The Glue data catalog

Let’s upload the initial contact data using the command

aws s3 cp contact.csv s3://medium-dwh/stg/contact/

In addition the stack also contains the Athena workgroup “dwh”. Let’s connect to the AWS web interface, select the Athena service, and switch the workgroup to “dwh”

The let’s insert the ghost record in the contact_sat table by executing the following query

insert into dwh_raw.contact_sat (id, hash, first_name, last_name, address, load_ts) values (-99999999, ‘0d1b08c34858921bc7c662b228acb7ba’,’?’,’?’,’?’, cast (current_timestamp as varchar))

Creation of the ghost record in the contact_sat table

Now we can proceed and create the step function that will load the data. Open the AWS web interface, choose the “Step function” service, and choose “State machines” from the left

Creation of the step function

Choose “Create state machine” in the next screen

State machine creation

We are going to create a standard state machine by authoring with code snippets

In the definition section past the code contained in the file step-function.json that you find in the github repo and click “Next”

Step function definition

In the next screen choose “dwh-loader-raw-layer” as name, leave all other parameters to their default values, and press “Create state machine”

Step function details

It is now time to execute the step function by pressing “Start execution”

Step function execution

Choose a meaningful name for the execution like “first-load” and press “start execution”

The first load should finish successfully as in the screenshot below

Let’s check that the data have been loaded correctly by opening Athena and executioning a couple of queries

select count(*) from dwh_raw.contact_sat

select * from dwh_raw.contact_sat order by 1

The contact_sat has been loaded correctly. Let do the same test for the contact_hub

select count(*) from dwh_raw.contact_hub;

select count(*) from dwh_raw.contact_hub;

select * from dwh_raw.contact_hub order by id;

select * from dwh_raw.contact_hub order by id;

It now time to execute a second data load using a different file. The filename is contact-delta.csv. Let’s remove contact.csv from the S3 bucket and add contact-delta.csv

aws s3 rm s3://medium-dwh/stg/contact/contact.csv

aws s3 cp contact-delta.csv s3://medium-dwh/stg/contact/

The file contact-delta.csv contains 6 records

contact-delta.csv

Loading contact-delta.csv will:

  • create one new version for each of the records with id 0 and 1since the new records are different than the existing ones
  • leave the record with id 2 unchanged since the records are identical in both files
  • create three new records with id 9999997, 9999998, and 9999999

Let’s load the file by executing the step function again

delta- load

Execute the following query to check that the records with id 0, 1, and 2 haven been loaded correctly

select * from dwh_raw.contact_sat where id in (0,1,2) order by 1

select * from dwh_raw.contact_sat where id in (0,1,2) order by 1

The records have been loaded correctly.

Execute the following query to check that the records with id 9999997, 9999998, and 9999999 haven been loaded correctly

select * from dwh_raw.contact_sat where id in (9999997,9999998,9999999) order by 1

select * from dwh_raw.contact_sat where id in (9999997,9999998,9999999) order by 1

We need now to check that the concact_hub table has been loaded correctly. We use the following query

select count(*) from dwh_raw.contact_hub

select count(*) from dwh_raw.contact_hub

This is correct since only the new 3 records with id 9999997, 9999998, and 9999999 have been loaded.

Conclusions and next steps

This article demonstrated that it is possible to create a data warehouse in AWS using:

  • S3, to store the data
  • Athena, to ingest and transform the data
  • Step functions, to orchestrate the data ingestion and transformation
  • Data vault 2.0, as a methodology to implement and design the data warehouse

The performance analysis of such architecture is analyzed in this article.

In future posts I plan to investigate the cost of running a data warehouse using such architecture. I plan also to compare this architecture with a similar one that uses Amazon Managed Workflows for Apache Airflow instead of Step Functions.

Thank you for reading.

--

--

Daniel Mannino
The Startup

I am a cloud-native analytics architect and my goal is to bring companies from drowning in data to swimming in innovation