Data Vault 2.0 + Step Functions + Athena + S3 = Data Warehouse/Data Lake?
Introduction
This article describes a proof-of-concept architecture for a data warehouse / Data Lake built in AWS using:
- S3, to store the data
- Athena, to ingest and transform the data
- Step functions, to orchestrate the data ingestion and transformation
- Data vault 2.0, as a methodology to implement and design the data warehouse
The article assumes that you have a good knowledge of AWS (S3, Athena, CLI, and Cloudformation).
The article starts with some disclaimers, continues with describing the motivation for such design, what we are going to build, and then how to build it. Conclusions and next steps close the article.
The article is part of a serie. You can find the second part here.
Disclaimers
All views expressed in this article are my own and do not represent the opinions of any entity whatsover with which I’ve been, am now, or will be affiliated.
The design proposed in this article is nothing more than a proof-of-concept. More tests and work have to be done before declaring this design production-ready. I will further test and optimize the design in the coming weeks and describe my finding in a future articles.
Motivation
I’ve spent all my career, 20 and plus years, building analytical systems. When I started there were few options for building such systems. I’m thinking about methodologies like the one proposed by Ralph Kimball (https://en.wikipedia.org/wiki/Ralph_Kimball) or technolgies like Oracle, AbInitio, Informatica Powercenter, or Teradata to name a few.
Two innovations brought data warehousing to next level.
The first is the data vault 2.0 methodology (https://datavaultalliance.com/) that gave us an incredible powerful way of designing and implementing data warehouse.
The second is the advent of the cloud that brought us new tools to build incredible scalable analytical systems. Three cloud characteristics are enabling this:
- the separation of compute and storage
- the virtually unlimited storage capacity
- an elastic compute capacity that can rapidly scale up/down
It is interesting to notice that few cloud analytic platforms are actually able to fully leverage those 3 characteristics. My favourite among them is Snowflake (https://www.snowflake.com/).
I’m a very curious person and I’m always looking for new technologies to improve the way we build analytical systems. One week ago, I had a look at the service “AWS step functions” (https://aws.amazon.com/step-functions/) and I asked myself: can I use it to orchestrate the load of a data warehouse? The answer is yes and I wrote this article to share my experience with you.
What we are going to build
We are going to load contact information from a CRM system into the raw layer of a data warehouse / Data Lake designed using the data vault 2.0 methodology.
The contact information is stored in a csv file with the following columns:
- id, the id of the contact
- account_id, the account id of the contact
- first name
- last name
- address
We are going to load the contact information into 1 hub and 1 satellites: the contact_hub and the contact_sat tables.
The contact_hub table has the following attributes:
- id, the contact id
- load_ts, the timestamp of the record load
- src, the name of the source system (‘CRM’, in this case)
The contact_sat table has the following attributes:
- id, the contact
- hash, the md5 of the contact attribute (it is used to implement the SCD2 logic)
- load_ts, the timestamp of the record load
- first_name
- last_name
- address
This design is OK for a POC. In a real implementation I would move the address to a new satellite since it has a different rate of change than the first and last name.
Both tables are stored in S3 as parquet files and their metadata are stored in the AWS Glue catalog.
Orchestrating the data loading is the duty of a step functions with 2 parallel states as shown in the picture below
The step function calls Athena to execute two queries that actually load the data into the target tables.
We are going to execute an initial data load and then a second one to test the correctness of the data load logic.
How to build it
The github repo https://github.com/TheDataStrategist/s3-athena-stepfunction-dwh contains all the code and data with the exception of the step function that we will build using the AWS GUI.
Let’s start by creating a directory and cloning the repository in it
git clone https://github.com/TheDataStrategist/s3-athena-stepfunction-dwh
The file dwh-stack.yaml contains the Cloudformation template for:
- the S3 bucket that will contains all our data
- the Glue databases and tables
- the Athena workgroup
The template contains the parameter S3Location that determines the name of the S3 bucket. Since S3 bucket names must be unique, choose a S3 bucket name and create the Cloudformation stack using the following command:
aws cloudformation create-stack — template-body file://dwh-stack.yaml — stack-name dwh — parameters ParameterKey=S3Location,ParameterValue=”<S3 bucket name>”
I’ve choosen the name “medium-dwh” for the S3 bucket and therefore the command is
aws cloudformation create-stack — template-body file://dwh-stack.yaml — stack-name dwh — parameters ParameterKey=S3Location,ParameterValue=”medium-dwh”
Once the stack is created the Glue data catalog should contain the following databases/tables
Let’s upload the initial contact data using the command
aws s3 cp contact.csv s3://medium-dwh/stg/contact/
In addition the stack also contains the Athena workgroup “dwh”. Let’s connect to the AWS web interface, select the Athena service, and switch the workgroup to “dwh”
The let’s insert the ghost record in the contact_sat table by executing the following query
insert into dwh_raw.contact_sat (id, hash, first_name, last_name, address, load_ts) values (-99999999, ‘0d1b08c34858921bc7c662b228acb7ba’,’?’,’?’,’?’, cast (current_timestamp as varchar))
Now we can proceed and create the step function that will load the data. Open the AWS web interface, choose the “Step function” service, and choose “State machines” from the left
Choose “Create state machine” in the next screen
We are going to create a standard state machine by authoring with code snippets
In the definition section past the code contained in the file step-function.json that you find in the github repo and click “Next”
In the next screen choose “dwh-loader-raw-layer” as name, leave all other parameters to their default values, and press “Create state machine”
It is now time to execute the step function by pressing “Start execution”
Choose a meaningful name for the execution like “first-load” and press “start execution”
The first load should finish successfully as in the screenshot below
Let’s check that the data have been loaded correctly by opening Athena and executioning a couple of queries
select count(*) from dwh_raw.contact_sat
select * from dwh_raw.contact_sat order by 1
The contact_sat has been loaded correctly. Let do the same test for the contact_hub
select count(*) from dwh_raw.contact_hub;
select * from dwh_raw.contact_hub order by id;
It now time to execute a second data load using a different file. The filename is contact-delta.csv. Let’s remove contact.csv from the S3 bucket and add contact-delta.csv
aws s3 rm s3://medium-dwh/stg/contact/contact.csv
aws s3 cp contact-delta.csv s3://medium-dwh/stg/contact/
The file contact-delta.csv contains 6 records
Loading contact-delta.csv will:
- create one new version for each of the records with id 0 and 1since the new records are different than the existing ones
- leave the record with id 2 unchanged since the records are identical in both files
- create three new records with id 9999997, 9999998, and 9999999
Let’s load the file by executing the step function again
Execute the following query to check that the records with id 0, 1, and 2 haven been loaded correctly
select * from dwh_raw.contact_sat where id in (0,1,2) order by 1
The records have been loaded correctly.
Execute the following query to check that the records with id 9999997, 9999998, and 9999999 haven been loaded correctly
select * from dwh_raw.contact_sat where id in (9999997,9999998,9999999) order by 1
We need now to check that the concact_hub table has been loaded correctly. We use the following query
select count(*) from dwh_raw.contact_hub
This is correct since only the new 3 records with id 9999997, 9999998, and 9999999 have been loaded.
Conclusions and next steps
This article demonstrated that it is possible to create a data warehouse in AWS using:
- S3, to store the data
- Athena, to ingest and transform the data
- Step functions, to orchestrate the data ingestion and transformation
- Data vault 2.0, as a methodology to implement and design the data warehouse
The performance analysis of such architecture is analyzed in this article.
In future posts I plan to investigate the cost of running a data warehouse using such architecture. I plan also to compare this architecture with a similar one that uses Amazon Managed Workflows for Apache Airflow instead of Step Functions.
Thank you for reading.