Event-driven ELT with AWS Lambda

Alexander Hallsten
CDP Insights
Published in
4 min readOct 9, 2020

--

Have you wondered if there is another way to build ETL than with traditional tools? And to utilize the potential of the cloud?

When we started building our Customer Data Platform (CDP), named Data Talks Pro, the only constraint we put on ourselves was that it would have to be cloud-based. Partly because we didn’t own any infrastructure and partly because we wanted to try out the cloud and see what could be built there. So, we built the first version on a traditional ETL tool.

Not long after we started working with the previously mentioned tool, we figured there must be some other way to do it where we can also utilize the different services AWS offers and not just use it as an IaaS. This made us look into AWS Lambda as the main tool for our integration engine.

At Data Talks, we have always prided ourselves in looking at new technology and not getting stuck in the old. So, one of the biggest reasons for choosing AWS Lambda as an ETL tool was the chance to test out something completely different from what any of us had worked with before.

AWS Lambda is one of the core services AWS provides. It is a function-as-a-service that allows you to run snippets of code without needing your own infrastructure or even any virtual machines. It can be triggered either by different AWS services, a schedule or directly from another application.

What AWS Lambda allowed us to do was to build a generic integration engine that could handle files from multiple sources with different formats. We do not have to develop new flows just because we get a new customer. We just need to add some data to the metadata database.

This allows us to onboard new customers faster than if we would have continued using our previous tool. If we do have to develop new tables or schemas that need to be loaded, we only have to create the stored procedures for loading the tables and specify them in the metadata database.

ELT instead of ETL

We were also interested in switching to ELT instead of ETL. The traditional way of ingesting data into a data warehouse is through a process called ETL — Extract, Transform, Load. This means that the process extracts data from the source system and then transforms it before loading it into the data warehouse. ELT — Extract, Load, Transform, turn that around and loads the data before the transformations happen.

This means that the data gets stored in the data warehouse at an earlier stage. The main advantage of ELT compared to ETL is the speed of loading data into the data warehouse, as well as the speed of which new data can be ingested.

What we have done is to take ELT and divided it into microservice architecture, something that allows us to be more agile in our development process and reuse more of the code we’ve already developed.

So how does it work?

The integration tool, which we have named CELT, consists of several AWS services and several different AWS Lambdas.

When a file arrives on S3, it will trigger a Lambda that will move the file to another folder in the bucket. Then it will assign a load id to the file by looking up the customer in the metadata and checking what the next id should be. After the load id is assigned and the file is moved to the correct folder, the next Lambda to be triggered will depend on the file format so that the correct ingestion into our data warehouse is done.

When the file is done ingesting into a staging schema in our data lakehouse, the next Lambda starts. It will look at a table in the metadata to see which transformations should be done. This is controlled by the name of the file and which customer the file belongs to. This part of the ELT process will “loop” until there aren’t any more steps in the metadata table.

The Lambdas are triggered by sending messages to each other over AWS SNS (Simple Notification Service), which is a message queue where the messages get pushed to the configured recipients. The advantage of using a push architecture instead of a pull architecture is that the receiving services can lie dormant until they get triggered by the earlier step.

What are the advantages of this compared to a more traditional ETL tool?

As mentioned above, this allows us to have a multi-tenant solution where we can onboard new customers in a short amount of time, since we only need to add data in the metadata database. Perhaps we need to create some new transformations, but we don’t need to redeploy or rebuild the ingestion flow.

One of the advantages of using a totally serverless architecture is the ease of which we can scale the ingestion flow. We don’t have to deal with spinning up additional EC2 instances when we need more capacity.

--

--

Alexander Hallsten
CDP Insights

Senior consultant at Rebtech, interested in how data can be used to make better decisions.