Automating ETL jobs on AWS using Glue, Lambda, EventsBridge and Athena

5 min readJan 11, 2022

Introduction

This article will cover one of the ways to move on-premise files to Amazon Web Services data lake (S3) and automatically trigger other services that operate on the data and make it instantly available for analysis using SQL, Spark, Python, and other languages

Details

This can be done using an AWS template as detailed on this link

Build and automate a serverless data lake using an AWS Glue trigger for the Data Catalog and ETL…

Today, data is flowing from everywhere, whether it is unstructured data from resources like IoT sensors, application…

aws.amazon.com

But I believe it’s better to learn how to inter-link each service just in case you are required to create an alert or modify an existing pipeline

We are utilizing the following AWS service:

Simple Storage Service (S3)
Glue Crawler
Lambda
Glue ETL job
Events Bridge
Simple Notification Services (SNS)

Image Showing our Data lake automation workflow

Below are the Steps

Covert the files from csv (or any other format) to Parquet. Parquet files are more than 70% lighter than csv files

# import all necessary librariesimport pandas as pd
import numpy as np
import boto3
import pyarrow as pa
import pyarrow.parquet as pq
from s3fs import S3FileSystem
import snappy# read csv file into pandas df and convert it to parquet
df = pd.read_csv("/path/to/file/file.csv")
df.to_parquet("/path/to/file/file.parquet", engine='auto',compression='snappy')

2. Create S3 client using the boto3 library

s3_client = boto3.client('s3',region_name='your-awsregion',
aws_access_key_id='your-aws-key',aws_secret_access_key='your-aws-secret-key')

3. Load parquet file into your S3 bucket.

Check here for official documentation on how to create S3 buckets https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html

output_file = “s3://your-s3-bucket-name/file.parquet” 
s3 = S3FileSystem()
#read parquet file into a pandas df
paq_df = pd.read_parquet(/path-to-parquet-file/file.parquet)
table = pa.Table.from_pandas(paq_df)
pq.write_to_dataset(table=table,
 root_path=output_file, partition_cols=['date'], #optional
 filesystem=s3)

AWS Configurations

Create a Lambda function with an S3 trigger

YOU CAN GET THE CODE FOR THIS PROJECT HERE:

https://github.com/kimerajoseph/automate_etl_jobs

This notification can be created either on a lambda function or on the S3 bucket. We shall cover the Lambda option

Log into your AWS account
On the services menu (upper left corner), type Lambda
Go to functions
Create function
You can use a boilerplate or start from scratch. Select the start from scratch option
Input a function name
For Run time, select a python version (I used python 3.7)
Click on the “change execution role” drop-down
Choose the “Use an existing role” option. You need to have an IAM role that has access to Glue, SNS, and events bridge. Take a look at official documentation here: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html
Click on “create function”
On the function page, click on “add trigger”

On the trigger configuration drop-down, select S3
Select Bucket name
Event type — Select “all object create events” option
You can add a prefix — may be a bucket folder name
Then select suffix — Can be a file extension type eg .csv. The lambda will only be triggered on csv file upload only
Acknowledge the AWS warnings about saving the file to the same bucket (NEVER SAVE THE SOURCE AND PROCESSED FILES TO THE SAME BUCKET AS THIS WOULD TRIGGER THE LAMBDA CONTINUOUSLY AND LEAD TO EXCESS CHARGES)
Go to configuration > General configuration > edit and change the timeout from 3 seconds to about a minute (You can choose a value not exceeding 15 minutes which is Lambda’s higher limit)
The created trigger should now be visible on the left side of the main window (check the image above, where the “add trigger” option is)
Always check your Lambda function and confirm it has a trigger

Access all the lambda functions code here:

https://github.com/kimerajoseph/automate_etl_jobs

2. Create the Glue crawler

Again go to services, search Glue and click on the “Glue” in the options window
click on crawler >> add crawler
Check the official documentation here: https://docs.aws.amazon.com/glue/latest/ug/tutorial-add-crawler.html
Replace the S3 bucket with your bucket name (Bucket where we uploaded the parquet file)
Do not run the crawler yet

3. Create a Glue ETL job

On the Glue page, left side menu, click on the “AWS Glue Studio” option
Again, you can take a look at the official documentation here:https://docs.aws.amazon.com/glue/latest/ug/creating-jobs-chapter.html
Click on jobs (left side drop-down menu) >> select the visual with source and target option
Go to job details and enter a job name
Select the IAM role (Again create a role and assign it the necessary permissions: Lambda, events bridge, SNS, and S3 full access authorization)
Job timeout. Set it to 5 Minutes
create a simple ETL job. Read data from an S3 bucket (the parquet file we uploaded earlier), transform it (YOU CAN SKIP THIS STAGE), and save it into a different S3 bucket

4. Create events bridge rules

We are going to create two rules that trigger two different functions

To trigger the first Lambda that will start the Glue crawler
The first rule will trigger the lambda that will start the Glue ETL job when the crawler is finished crawling the data
The second one will trigger a second lambda that will write to an SNS topic. The SNS topic will then send out email notifications

Type “events bridge” in the services menu (top left corner)

On the EventBridge page, click on “create rule”

Enter name and description
On “define pattern”, select “Event pattern”
Select “custom pattern”
Select options are shownin the picture below

Rule options for triggering lambda after crawler is finished

scroll down to the “select targets” section
Under target, select “Lambda function”
Function, Select your lambda function that will start the Glue ETL job

REPEAT THE SAME PROCEDURE FOR SECOND LAMBDA TRIGGER.

For “event patter”, select custom
Paste code below in the “event patter” text box

"detail-type": [
"Glue Job Run Status"
],
"detail": {
"jobName": [
"<your-glue-etl-name>"
],
"state": [
"SUCCEEDED"
]
}
}

NOTE: You can change the “state” to the state you are monitoring (failed, timeout etc)

Select the Lamda function to be triggered when the ETL job status changes

5. Create SNS topic and subscription

On the SNS page, go to “topics”
Select “create topic”
Select “standard”, enter topic name and description
Click “create topic”
On the topics page, select “create subscription”
Select the topic ARN, under the protocol, select “email”
Enter endpoint (your email)
Click on “create subscription”
You will receive an email from AWS requesting you to confirm your subscription to SNS. Confirm it
On the subscription page, it should show “confirmed” under the status

NOW TO BRING EVERYTHING TOGETHER

Load data into an S3 bucket by executing the upload_parquet_file_to_S3.ipnyb file in the GitHub repo
Check the crawler status from the Glue page. It should show the status “running”. Refresh the page and check again
After the crawler stops, check the Glue ETL “job details”.
Finally, you should get an email on the email you registered under the SNS topic
Check logs under cloud watch to confirm
Clean up all the resources