Next Gen Big Data with AWS — Part 1

2 min readJan 9, 2017

Create EMR Cluster based on S3 Event

Then and Now

For years, organizations run their big data workloads on their in-house infrastructure. Their cluster are up and running even during off peak working hours. With AWS Cloud-based adaptation efforts, organizations benefit faster availability in market, lower cost and newer revenue opportunity.

Use Case

AWS Lambda to spin up an EMR Cluster based on S3 event(like put object) and tear down the cluster once the computation is done for optimized resource utilization. In this demo we will load a csv file to S3 bucket which send a event notification to AWS Lambda. Using boto3 API in lambda function, we will create EMR Cluster.

AWS Lambda Function

Configure AWS Lambda function with Python runtime to run this code. boto3.client(‘emr’) is low-level API client representing EMR. run_job_flow RunJobFlow creates and starts running a new job flow. We will configure EMR Instances with InstanceType, InstanceRole etc. Also set up EMR_EC2_DefaultRole, EMR_DefaultRole in the run_job_flow. Save and Test lambda function to make sure the configuration setup is good.

S3 Event

Create a bucket and set up Event Notification to send trigger to the lambda function. ObjectCreated(put, post, copy) events will the trigger function when object with csv suffix is created.

Action

Once a s3 object is created in the bucket, a EMR cluster is created. Log on to AWS EMR console to check the progress. You can specify the Instance Types and applications to be installed in the run_job_flow section of Lambda function.

Set up SNS notification to trigger another Lambda function to add a step to EMR cluster. The step will run a hive/ spark job with the uploaded S3 file as input and load the output file to S3 bucket. Finally tear down the cluster when no more steps to run by setting KeepJobFlowAliveWhenNoSteps to False. Stay tuned for part 2.