Serverless ETL Processing in AWS

Ankit K Singh
Oct 15, 2020 · 4 min read

Recently, I was working on a challenge posted by A Cloud Guru where we need to process Covid19 data from two sources and store it into a database for further reporting. Here are the approach and difficulties I faced during the development and testing phase

Approach:

Data Analysis: Being a data processing project, I first wanted to start exploring the data and the Jupyter notebook is the best tool for this purpose. I loaded the into Jypyter notebook and with help of NumPy and Pandas, I was able to check out columns in files and its data types. After cleaning up the data, I tried to visualize the data using Matplotlib.

https://github.com/erankitcs/EventDrivenPythonAWS/blob/main/datanalysis/CloudGuruChallenge_Jypyter_Analysis.ipynb

Move to Lambda: After completion of the analysis, I created a basic lambda function to read source files and perform the same data analysis similar to Jypyter notebook.

Cloud Design: Once the data processing skeleton was ready, I started focusing on how I can use AWS services in my cloud architecture which leverage both best design practices and security. So I decided to go ahead with the below design consideration.

1. Keep file download activity from public URL away from actual data processing. So I created a lambda function which is getting triggered daily through Cloud Watch Time Event and it loads the file into our Landing Zone (S3 bucket). This lambda job is sending an SNS message to a Topic subscribed by our ETL Lambda function.

2. Because our ETL Lambda would be dealing with data so we should place it in an isolated place. So, I was having two options, the first one was RDS proxy which is easy to implement, and the second one was placing lambda itself into VPC. I think placing lambda into VPC gives you better control in terms of security. So I have moved my second lambda function into a VPC (default one in my case)

3. I have selected the RDS database with PostgresSQL simply because I wanted to use AWS QuickSight for reports. I am securing database access to the lambda function with help of the AWS secret manager.

4. As, I have moved ETL lambda into VPC. VPC Endpoints is required to access S3 bucket, SNS, and secrete manager.

5. Logging is maintained into CloudWatch.

6. IAC: I was having the option of both CloudFormation and Terraform but I picked up Terraform as it is cloud-agnostic and widely used. I think it's more declarative style coding with better control.

Architecture

Challenges

  1. Pandas inside Lambda: The first challenge, I faced was how to use Pandas and NumPy in the Lambda function. Initially, I was installing packages locally then placing them into the lambda folder but Lambda was not picking up. After investing a couple of hours, I could understand thar due to compatibility issue, you need to download the specific version from PyPI and unzip into the lambda function folder. “pandas-1.1.3-cp36-cp36m-manylinux1_x86_64.whl” https://pypi.org/project/pandas/#files
  2. Packaging Lambda function in Terraform: Another challenge waiting for me was how to package the Lambda function code in Terraform. There were multiple ways to do it. So initially, I was using the latest terraform function “filebase64sha256” for hash input along with “archive_file” data resource and “null resource” having a trigger to identify the changes but terraform deployment was failing badly due to hash mismatch. So, I decided to move away from “filebase64sha256” function and start using older style “data.archive_file.lambda_package.output_base64sha256” for source_code_hash input in lambda resource.
  3. Loading Pandas DataFrame into PostgreSQL: While working with psycopg2, I was facing difficulties to load the data frame. I was able to find that “psycopg2.extras.execute_batch” function. With help of this function we can load multiple inserts together.
  4. VPC Endpoint for SNS, Secrete manager: While working with our ETL lambda function inside VPC, I was NOT able to push notification to SNS and it was just getting a timeout. After multiple analysis, I figured out that I missed this little piece of code in my terraform. “private_dns_enabled = true”. So basically this change allows communication via private DNS from Lambda within VPC.
  5. SNS Email Subscription Terraform: Unfortunately, Terraform does not support Email Subscription to SNS services. So I was having two options in Terraform to achieve this objective indirectly. First, AWS CloudFormation stack creation resource and write an Email subscription to SNS in yml file. The second option was to run a shell script with a local provisioner which can call AWS CLI for an email subscription to SNS.

Below is the git hub repo for this project, you can go through and reach me out if you have any feedback or query.

Dashboard in AWS QuickSight

Challenge Link: If you want to understand a bit more about the challenge then follow along below URL.

https://acloudguru.com/blog/news/introducing-the-cloudguruchallenge

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +730K followers.

Ankit K Singh

Written by

A data science enthusiast and Passionate for Cloud Technology. AWS, Azure, GCP, MongoDB certified developer/architect. Interested in Google Cloud Engineering.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +730K followers.