AWS EMR Elastic Map Reduce — a Tiny Demonstration using AWS CLI

Amazon EMR is a PaaS (Platform as a Service) that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks you can process data for analytics and business purposes. EMR can also transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

In this demo we will use input data (text documents containing English words) stored in an S3 bucket, process it using a sample Python application to count the words and finally group the results. The output will also be stored in the same S3 bucket.

What is so special about this word count program

Image for post
Image for post

Though process look simple, the advantage of EMR is in its applicability in bigdata or large quantity of data and the ability to process in parallel. Input data is split into smaller chunks (map) and send to the workers (in this case Python word count application) and the results are aggregated (reduce) to produce a meaning full output quickly.

Join me in this tiny command line demo.

Create Input Bucket

aws s3 --region ap-south-1 mb s3://emr-demo-sree

Please note that to keep it simple we will use the same bucket for the results as well.

Copy the Input Data and Processing Program

aws s3 sync s3://elasticmapreduce/samples/wordcount wordcount
tree wordcount
├── input
│ ├── 0001
│ ├── 0002
│ ├── 0003
│ ├── 0004
│ ├── 0005
│ ├── 0006
│ ├── 0007
│ ├── 0008
│ ├── 0009
│ ├── 0010
│ ├── 0011
│ └── 0012

Create AWS EMR Roles

aws emr create-default-roles

This builtin facility to create the roles needed by EMR is a great aid.

Create EMR Cluster

aws --region ap-south-1 emr create-cluster  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --name "Test Cluster" --log-uri s3://emr-demo-sree/logs/ --enable-debugging --tags Name=emr \
--ec2-attributes '{"KeyName":"my-demo-key","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-nnnn","EmrManagedSlaveSecurityGroup":"sg-nnnn","EmrManagedMasterSecurityGroup":"sg-nnnn"}' \
--release-label emr-5.13.0 \
--service-role EMR_DefaultRole

Add Processing Steps

aws --region ap-south-1 emr add-steps --cluster-id $cid \
--steps Type=STREAMING,Name='Word Count',ActionOnFailure=CONTINUE,Args=--files,s3://emr-demo-sree/,-mapper,,-reducer,aggregate,-input,s3://emr-demo-sree/input,-output,s3://emr-demo-sree/output

"StepIds": [

Repeat the above command till you see “COMPLETED” state.

Fetch The Results and Verify

aws s3 sync s3://emr-demo-sree/output wordcount/output
download: s3://emr-demo-sree/output/_SUCCESS to wordcount/output/_SUCCESS
download: s3://emr-demo-sree/output/part-00000 to wordcount/output/part-00000
download: s3://emr-demo-sree/output/part-00002 to wordcount/output/part-00002
download: s3://emr-demo-sree/output/part-00001 to wordcount/output/part-00001

The output folder from the S3 bucket has the aggregated word counts!


aws --region ap-south-1 emr terminate-clusters --cluster-id $cid

The Python Program Used in the Above Demo provided by AWS is a follows

import sys
import re

Though this example is from AWS itself, hope this tiny demo can help in clearly understanding what is EMR and one specific use case. Please follow me for such tiny demos! Thank you for you time.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store