Configure and install AWS EMR

Published in

WeScale

8 min readFeb 14, 2020

In this short paper I would like to share my experience in putting in place EMR, the goal is to explain the key points for this service, and allow you to go fast to launch it. The source code is available [here] for you.

EMR or Elastic MapReduce, is an AWS managed service acting like a toolbox. It allows you to launch easily a cloud-native big data platform, using open source tools, like Apache Spark, Apache Hive, Apache HBase, Apache Flink, et Presto etc … For more details, you can find the AWS documentation [here].

Many thanks:

I would like to begin by thanking Christian Nuss the author of this [project]: this was my starting point, and it helped me a lot to implement mine. So many thanks!!!.

Ready !!!

Let me introduce to you the toolkit box that I have used for this project:

AWS account
Terraform 0.11.14
Ansible 2.8.3
AWS EMR 5.23.0

Simple, right … ?!

For starters, I wanted to make a schema that shows the whole infrastructure that I am going to build, of course this infrastructure is customizable to your liking.

The different key points in this infrastructure are:

Storage:

I use the first S3 bucket called bootstrap, which is going to hold bootstrap scripts for the cluster, and this is the only way to pass init scripts to [EMR].

The second bucket logs serves as log holder, that contains all bootstrap logs before and after EMR starts.

I also use EBS volumes attached to each node instance. You can choose between the use of a global storage like S3 bucket with EMRFS, so your data can be easily reusable by other resources. Or use a local storage like EBS volumes with HDFS, so the data will be available only inside your cluster.

Encryption:

EMR supports two types of encryptions:

In repos:

This concerns all storage data supports like S3 and EBS volumes. For S3 I use the server side encryption, it is free and does the job. For EBS I use a KMS key, and this is the only way to encrypt data for this service.

In transit:

This concerns the data in transmission between cluster nodes. For this I use a self signed certificate.

Data processing:

The cluster that I have used contains one master and two core nodes. I will explain in details this choice later. To know more about the role of each component, please checkout this [link].

Autoscaling:

Elasticity is one of my favorite features of the cloud. For cost reasons I would like my cluster to scale up whenever there is an important workload, and to scale down afterwards. The yarnMemory is the most frequent metric that is used for this case, it indicates the amount of memory available in the whole cluster. In my configuration i have used two rules:

When the available yarn memory is less than 15% for one 5 minutes window, the cluster adds two nodes at a time.
When the amount of available memory is greater than 75% for one 5 minutes window, the cluster deletes one node after each positive check.

You can check the configuration in details in the emr layer:

│ ├── emr│ │ ├── main.tf│ │ ├── outputs.tf│ │ ├── templates│ │ │ ├── autoscaling_policy.json.tpl

Access Management with IAM:

The cluster needs different IAM roles for different jobs:

To access S3 buckets, for bootstrap and logs.
To run other EC2 instances when the autoscaling is started.

Network:

I would like my cluster to run in a private subnet. So I have created a VPC with private and public subnets, nat-gateways, internet gateway, route tables, security groups … etc, classic stuff, all the details are shown bellow.
Bastion host: this EC2 instance is used for administration reasons. It will be used as a jump host to SSH to different EMR instances. You can use also other solutions like SSM agent or a VPN …etc
Security Groups: the EMR cluster instances accept only traffic coming from bastion and ALB. Hence, the bastion and ALB are internet facing with restrict IP addresses range.
DNS and ALB: in the EMR cluster you can run different services that exposes web interfaces. I have used an ALB with different listeners. This can be mixed with different Route53 paths, in order to route traffic to the right service.

Showtime:

In this part we are going to see together the different Terraform layers that I have used to build this project:

The code base is organized as follows:

- network-layer- bastion-layer- emr-layer- dns-layer

1- Network-layer:

This layer is the base of all the others, it hosts all the network components.

I have used a classical scheme of a VPC, three public subnets, three private subnets, one internet gateway, one nat-gateway (need only one in my case), route tables, security groups: one for bastion host, one for ALB, one for EMR instances, etc ..

The code of this part is quite simple, there is one main file that calls one module network that creates all the network components.

├── backend.tf├── errored.tfstate├── gateway.tf├── inputs.tf├── outputs.tf├── provider.tf├── routes.tf├── subnets.tf├── vars.tf└── vpc.tf

2- Bastion layer:

Creates a bastion host with elastic IP. This allows me to SSH into the different machines on the plateforme for administrative purposes. The main.tf file calls the different modules:

EC2: to create EC2 instance
SGS: to create the security groups attached to bastion.
DNS: to create a DNS entry for the bastion, this is more human readable. And it also allows you to have a fixed entry whenever the elastic IP changes when this Terraform layer is destroyed and then applied again.

├── backend.tf├── generated│ └── ssh│ ├── bastion-default│ └── bastion-default.pub├── inputs.tf├── main.tf├── modules│ ├── dns│ │ ├── main.tf│ │ ├── outputs.tf│ │ └── vars.tf│ ├── ec2│ │ ├── eip.tf│ │ ├── main.tf│ │ ├── output.tf│ │ ├── ssh.tf│ │ └── vars.tf│ └── sgs│ ├── main.tf│ ├── outputs.tf│ └── vars.tf├── outputs.tf├── provider.tf└── vars.tf

3- EMR layer:

This layer is used to create all EMR resources, the main.tf file calls the different components in different modules.

Bootstrap : for bootstrap scripts
Security : for IAM policies and roles
EMR: to create the nodes of the cluster.

You can note that it is the same hierarchy used in the project mentioned above, with adaptations from my diagram.

├── backend.tf├── config.tf├── generated│ └── ssh│ ├── default│ └── default.pub├── inputs.tf├── main.tf├── modules│ ├── bootstrap│ │ ├── files│ │ │ ├── log4j_spark.properties│ │ │ ├── logrotate│ │ │ ├── logrotate.sh│ │ │ └── syslog.conf│ │ ├── main.tf│ │ ├── outputs.tf│ │ ├── templates│ │ │ └── configure-system.sh.tpl│ │ └── variables.tf│ ├── emr│ │ ├── main.tf│ │ ├── outputs.tf│ │ ├── templates│ │ │ ├── autoscaling_policy.json.tpl│ │ │ ├── configuration.json.tpl│ │ │ └── security_configuration.json.tpl│ │ └── variables.tf│ ├── s3│ │ ├── main.tf│ │ ├── outputs.tf│ │ └── variables.tf│ ├── sec│ │ ├── certs.tf│ │ ├── iam.tf│ │ ├── outputs.tf│ │ ├── ssh.tf│ │ └── variables.tf│ └── sgs│ ├── main.tf│ ├── outputs.tf│ └── variables.tf├── outputs.tf├── provider.tf└── variables.tf

4- DNS layer:

This layer is used to create an ALB which is used to route traffic to different services hosted on the EMR master, based on listener and path filtering. Also it creates differents DNS entries according to each service in the cluster. The main.tf file calls the other modules:

ALB: to create load balancer and listeners
DNS: to create private hosted zone with DNS entries for each service.

├── backend.tf├── config.tf├── inputs.tf├── main.tf├── modules│ ├── alb│ │ ├── main.tf│ │ ├── outputs.tf│ │ ├── target.tf│ │ └── variables.tf│ ├── dns│ │ ├── dns_instances.tf│ │ ├── main.tf│ │ ├── outputs.tf│ │ └── variables.tf│ └── sgs│ ├── main.tf│ ├── outputs.tf│ └── variables.tf├── outputs.tf├── provider.tf└── variables.tf

Conclusion:

Here is your cluster, now ready to be used, I would like to conclude with the feedback I had from this experience:

Limits that I have encountered:

Well, this was my first experience using big data products on AWS, and I noticed some negative points during the implementation, the different points are listed here:

I could not use fleet instances option for the autoscaling group, this was an implementation lack in Terraform 0.11, this option is not yet implemented. But this feature is very important, it allows you to use spot instances in autoscaling. You can also have up to five instance types, so the cluster can choose between them in auto scaling if one type is not available.
I could not use “Task” instances. In “Core” nodes, spark uses hdfs to store data within the cluster. Task nodes do not need local storage. This accelerates the autoscale / downscale process, they do not have to wait for the hdfs processes to end.
In the default emr configuration file the “Core” instances are the default setting. You do not have a way to change this parameter and you do not have a way to configure it before starting it. You have to change it after it starts and honestly i do not know the consequences. I did not go this far.
I could not change the “Tez” default engine and use “Spark” instead with “Hive”, I think this was a problem with “Hive” and “Spark” version. [compatibility]. And the problem here, is that you do not have full control on choosing the right version for each product. The EMR comes like a box with different products and fixed [versions], and unfortunately you cannot make these versions dynamic.
The fact that you run your cluster in only one AZ as you observed on the previous diagram, this means that in case of disaster you will lose your cluster. So maybe you have to think about the way you use the cluster, like starting the cluster only when you need it to treat large amount of data and shut it down afterwards. This is why we have this API parameter termination_protection = false/true.
terraform destroy: always fail with this error :

module.emr_jobs.aws_emr_security_configuration.security_configuration (destroy): 1 error occurred: aws_emr_security_configuration.security_configuration: InvalidRequestException: Security configuration ‘emr-default’ cannot be deleted because it is in use by active clusters.

It tries to delete the security configuration while the deletion is still in progress, this is a terraform issue. The workaround for this error is to use export TF_WARN_OUTPUT_ERRORS=1 and run terraform destroy again.

Another option is to use a local-exec with a sleep which allows to wait that the emr is totally deleted before deleting the security configuration.

To improve:

Here are the different improvement points that I have noticed during the implementation of this project:

In different layers, I have a folder which contains all the modules, in each module there are some code repetitions, these parts can be implemented within sub modules to look better. For me it was not so important, I targeted the simplicity.

The different points that I mentioned before about fleet instances, task instances etc .., is very important if it can be used in Terraform, especially for costs optimization.

In the end:

Even if the EMR AWS service is still having some compatibility issues, it allows you to go fast and have a cluster within 10 ~15 minutes, with different products like Spark, Hadoop, Hive .. etc already installed, and this is really awesome.

I highly recommend this [link]to get deeper into understanding how EMR works.

I am happy to share this experience with you, and I hope that it will help you to go faster in your AWS EMR projects.

Special thanks to my friends from The WeTribu who have patiently reviewed this article.