AUTOMATING DATA PIPELINES WITH AWS S3, EMR, AND LAMBDA
Amazon Web Services (AWS) has a variety of services and very often, you can use different services to achieve the same goal. Take a scenario where you wait for a file to land in an S3 bucket then process the data and insert it into a different S3 Bucket or a Relational Database (RDS). One of the ways is Airflow, combining the Airflow file sensor (To check if a file exists) and spark operator (to submit a spark job). One can use Lambda and EMR to achieve the same goal as shown in the image below
In our case, we set a file trigger on an S3 bucket. When a file is uploaded into the S3 bucket, a Lambda Function is triggered. The function creates an EMR cluster and executes a python file (or any other code) uploaded in an S3 bucket
STEPS FOLLOWED
- Upload your python or Spark code to an S3 bucket (backend_code_aws.py from the GitHub repo)
- Create our Lambda Function and paste the code from aws_lambda_emr_job.py file from the GitHub repo
- Add an S3 trigger to our Lambda function. Check out this article on how
- Create another S3 bucket to store the processed data/file (It should be different from the one holding the row file. If not, this will lead to continuous processing and abnormal costs)
- Upload a file into our S3 bucket
Source code can be found at this Github repo
Take a look at my earlier article covering the creation of triggers on Lambda functions
NOTE: The code in the GitHub repo contains comments that should make it really easy to understand.
The Lambda function does the following
- Reads the file and bucket name
- Reads the code to run on the created EMR cluster
- Creates the cluster by defining the number of instances, type of machines, defining service role, defining the applications to be installed on the cluster eg Spark, Hive etc, Defining the service role among others
- It passes args (file name & bucket name) to the code to be executed on the cluster
- Installs any additional software on the cluster needed to execute the job
If it’s all done well, you should see clusters on the AWS EMR home page
NOTE: Ensure the clusters are terminated before exiting the AWS EMR page to avoid costs