Automating ETL jobs on AWS using Glue, Lambda, EventsBridge and Athena

kimera joseph
5 min readJan 11, 2022

--

Introduction

This article will cover one of the ways to move on-premise files to Amazon Web Services data lake (S3) and automatically trigger other services that operate on the data and make it instantly available for analysis using SQL, Spark, Python, and other languages

Details

This can be done using an AWS template as detailed on this link

But I believe it’s better to learn how to inter-link each service just in case you are required to create an alert or modify an existing pipeline

We are utilizing the following AWS service:

  1. Simple Storage Service (S3)
  2. Glue Crawler
  3. Lambda
  4. Glue ETL job
  5. Events Bridge
  6. Simple Notification Services (SNS)
Image Showing our Data lake automation workflow

Below are the Steps

  1. Covert the files from csv (or any other format) to Parquet. Parquet files are more than 70% lighter than csv files
# import all necessary librariesimport pandas as pd
import numpy as np
import boto3
import pyarrow as pa
import pyarrow.parquet as pq
from s3fs import S3FileSystem
import snappy
# read csv file into pandas df and convert it to parquet
df = pd.read_csv("/path/to/file/file.csv")
df.to_parquet("/path/to/file/file.parquet", engine='auto',compression='snappy')

2. Create S3 client using the boto3 library

s3_client = boto3.client('s3',region_name='your-awsregion',
aws_access_key_id='your-aws-key',aws_secret_access_key='your-aws-secret-key')

3. Load parquet file into your S3 bucket.

Check here for official documentation on how to create S3 buckets https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html

output_file = “s3://your-s3-bucket-name/file.parquet” 
s3 = S3FileSystem()
#read parquet file into a pandas df
paq_df = pd.read_parquet(/path-to-parquet-file/file.parquet)
table = pa.Table.from_pandas(paq_df)
pq.write_to_dataset(table=table,
root_path=output_file, partition_cols=['date'], #optional
filesystem=s3)

AWS Configurations

  1. Create a Lambda function with an S3 trigger

YOU CAN GET THE CODE FOR THIS PROJECT HERE:

https://github.com/kimerajoseph/automate_etl_jobs

This notification can be created either on a lambda function or on the S3 bucket. We shall cover the Lambda option

  • Log into your AWS account
    On the services menu (upper left corner), type Lambda
  • Go to functions
  • Create function
  • You can use a boilerplate or start from scratch. Select the start from scratch option
  • Input a function name
  • For Run time, select a python version (I used python 3.7)
  • Click on the “change execution role” drop-down
  • Choose the “Use an existing role” option. You need to have an IAM role that has access to Glue, SNS, and events bridge. Take a look at official documentation here: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html
  • Click on “create function”
  • On the function page, click on “add trigger”
  • On the trigger configuration drop-down, select S3
  • Select Bucket name
  • Event type — Select “all object create events” option
  • You can add a prefix — may be a bucket folder name
  • Then select suffix — Can be a file extension type eg .csv. The lambda will only be triggered on csv file upload only
  • Acknowledge the AWS warnings about saving the file to the same bucket (NEVER SAVE THE SOURCE AND PROCESSED FILES TO THE SAME BUCKET AS THIS WOULD TRIGGER THE LAMBDA CONTINUOUSLY AND LEAD TO EXCESS CHARGES)
  • Go to configuration > General configuration > edit and change the timeout from 3 seconds to about a minute (You can choose a value not exceeding 15 minutes which is Lambda’s higher limit)
  • The created trigger should now be visible on the left side of the main window (check the image above, where the “add trigger” option is)
  • Always check your Lambda function and confirm it has a trigger

Access all the lambda functions code here:

https://github.com/kimerajoseph/automate_etl_jobs

2. Create the Glue crawler

3. Create a Glue ETL job

  • On the Glue page, left side menu, click on the “AWS Glue Studio” option
  • Again, you can take a look at the official documentation here:https://docs.aws.amazon.com/glue/latest/ug/creating-jobs-chapter.html
  • Click on jobs (left side drop-down menu) >> select the visual with source and target option
  • Go to job details and enter a job name
  • Select the IAM role (Again create a role and assign it the necessary permissions: Lambda, events bridge, SNS, and S3 full access authorization)
  • Job timeout. Set it to 5 Minutes
  • create a simple ETL job. Read data from an S3 bucket (the parquet file we uploaded earlier), transform it (YOU CAN SKIP THIS STAGE), and save it into a different S3 bucket

4. Create events bridge rules

We are going to create two rules that trigger two different functions

  • To trigger the first Lambda that will start the Glue crawler
  • The first rule will trigger the lambda that will start the Glue ETL job when the crawler is finished crawling the data
  • The second one will trigger a second lambda that will write to an SNS topic. The SNS topic will then send out email notifications

Type “events bridge” in the services menu (top left corner)

On the EventBridge page, click on “create rule”

  • Enter name and description
  • On “define pattern”, select “Event pattern”
  • Select “custom pattern”
  • Select options are shownin the picture below
Rule options for triggering lambda after crawler is finished
  • scroll down to the “select targets” section
  • Under target, select “Lambda function”
  • Function, Select your lambda function that will start the Glue ETL job

REPEAT THE SAME PROCEDURE FOR SECOND LAMBDA TRIGGER.

  • For “event patter”, select custom
  • Paste code below in the “event patter” text box
"detail-type": [
"Glue Job Run Status"
],
"detail": {
"jobName": [
"<your-glue-etl-name>"
],
"state": [
"SUCCEEDED"
]
}
}

NOTE: You can change the “state” to the state you are monitoring (failed, timeout etc)

  • Select the Lamda function to be triggered when the ETL job status changes

5. Create SNS topic and subscription

  • On the SNS page, go to “topics”
  • Select “create topic”
  • Select “standard”, enter topic name and description
  • Click “create topic”
  • On the topics page, select “create subscription”
  • Select the topic ARN, under the protocol, select “email”
  • Enter endpoint (your email)
  • Click on “create subscription”
  • You will receive an email from AWS requesting you to confirm your subscription to SNS. Confirm it
  • On the subscription page, it should show “confirmed” under the status

NOW TO BRING EVERYTHING TOGETHER

  1. Load data into an S3 bucket by executing the upload_parquet_file_to_S3.ipnyb file in the GitHub repo
  2. Check the crawler status from the Glue page. It should show the status “running”. Refresh the page and check again
  3. After the crawler stops, check the Glue ETL “job details”.
  4. Finally, you should get an email on the email you registered under the SNS topic
  5. Check logs under cloud watch to confirm
  6. Clean up all the resources

--

--