kimera joseph
2 min readMar 7, 2022

AUTOMATING DATA PIPELINES WITH AWS S3, EMR, AND LAMBDA

Amazon Web Services (AWS) has a variety of services and very often, you can use different services to achieve the same goal. Take a scenario where you wait for a file to land in an S3 bucket then process the data and insert it into a different S3 Bucket or a Relational Database (RDS). One of the ways is Airflow, combining the Airflow file sensor (To check if a file exists) and spark operator (to submit a spark job). One can use Lambda and EMR to achieve the same goal as shown in the image below

Image Showing our Data Flow

In our case, we set a file trigger on an S3 bucket. When a file is uploaded into the S3 bucket, a Lambda Function is triggered. The function creates an EMR cluster and executes a python file (or any other code) uploaded in an S3 bucket

STEPS FOLLOWED

  1. Upload your python or Spark code to an S3 bucket (backend_code_aws.py from the GitHub repo)
  2. Create our Lambda Function and paste the code from aws_lambda_emr_job.py file from the GitHub repo
  3. Add an S3 trigger to our Lambda function. Check out this article on how
  4. Create another S3 bucket to store the processed data/file (It should be different from the one holding the row file. If not, this will lead to continuous processing and abnormal costs)
  5. Upload a file into our S3 bucket

Source code can be found at this Github repo

Take a look at my earlier article covering the creation of triggers on Lambda functions

NOTE: The code in the GitHub repo contains comments that should make it really easy to understand.

The Lambda function does the following

  • Reads the file and bucket name
  • Reads the code to run on the created EMR cluster
  • Creates the cluster by defining the number of instances, type of machines, defining service role, defining the applications to be installed on the cluster eg Spark, Hive etc, Defining the service role among others
  • It passes args (file name & bucket name) to the code to be executed on the cluster
  • Installs any additional software on the cluster needed to execute the job

If it’s all done well, you should see clusters on the AWS EMR home page

NOTE: Ensure the clusters are terminated before exiting the AWS EMR page to avoid costs