Building Serverless ETL Pipelines with AWS Glue

Published in

INNOMIZE

4 min readJun 5, 2019

This post is originally published on our blog.

This article expects to resolve the requirement of extract data from an AWS RDS database then publish it to S3. The detailed are:

Ability to supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3 (such as XML, CSV format).
The big data size.
The data schemas can be different.

Besides,

The solution should be easily switched environments from Development → Test → User Acceptance Testing (UAT) → Staging → Production.
Autoscale hardware related to data size.

After research some solutions, I chose AWS Glue. Because AWS Glue is the ETL service provided by AWS. It has three main components, which are Data Catalogue, Crawler, and ETL Jobs. As Crawler helps you to extract information(schema and statistics) of your data, Data Catalogue is used for centralized metadata management. With ETL Jobs, you can process the data stored on AWS data stores with either Glue proposed scripts or your custom scripts with additional libraries and jars.

To get started we need to design the architecture. But you can follow my architecture for the ETL pipeline as below:

The following **architecture** for the ETL pipeline

Firstly, we have a schedule that runs daily at 1 AM to start AWS Crawler to generate the schema for our semi-structured data.

In this example, I just use the AWS RDS with the flatten data. But if data is mostly multilevel nested such as XML. You should use Glue PySpark Transforms to flatten the data or Data bricks Spark-XML.

Once the data is cataloged, a cloud watch event will trigger lambda to start Glue job. With Python or Scala script will transform data as we need then publish to s3.

Another cloud watch event is set up to notify the status of the job to support the team.

We are looking for an automated way for quick deployment in other environments. So I will guide you on how to define a Cloud Formation template to create all of the necessary resources. I assume the data stores are in a private subnet of VPC.

1. Setting Up Your Environment to Access Data Stores.

The AWS Glue should sit in private subnet to run your extract, transform, and load (ETL) jobs but it also needs to access Amazon S3 from within VPC then upload the report file. So a VPC endpoint is required.

The above template will create an S3 Endpoint resource and update the Security Group to allow all ports by self-referent.

When the above stack is ready, check the resource tab and find the details.

2. Populating the AWS Glue resources.

Refer to the above architecture we need to create some resources i.e: AWS Glue connection, database (catalog), crawler, job, trigger, and the roles to run Glue job.

When the stack is ready, check the resource tab and all of the required resources are created as below.

Note: You must upload the Glue Job script into the s3 path that you specify in the cloud formation template.

3. Monitoring AWS Glue Using Amazon CloudWatch Metrics.

AWS noted: You can profile and monitor AWS Glue operations using AWS Glue job profiler. It collects and processes raw data from AWS Glue jobs into readable, near real-time metrics stored in Amazon CloudWatch. These statistics are retained and aggregated in CloudWatch so that you can access historical information for a better perspective on how your application is performing.

With Amazon CloudWatch Metric we can create a cloud watch event rule and trigger an SNS topic to send an email about job state and the message about the result of the job.

When the stack is ready, check the resource tab. The resource as below.

Now that you have set up an entire data pipeline in an automated way with the appropriate notifications and alerts, it’s time to test your pipeline. Good luck.

Building Serverless ETL Pipelines with AWS Glue

1. Setting Up Your Environment to Access Data Stores.

2. Populating the AWS Glue resources.

3. Monitoring AWS Glue Using Amazon CloudWatch Metrics.

Written by Phong Trần