Setup a cluster of EC2 instances with AWS CloudFormation

edouard
nibbleai
Published in
5 min readApr 9, 2019
A stack, by Stephen Walker (via Unsplash)

This article is part of a serie of several posts where I describe how to build a 3-node Hadoop cluster on AWS. Still, most of its content can be useful if you want to know more about EC2, Security Groups, and especially CloudFormation, without being concerned by Hadoop at all.

We’ll take advantage of AWS CloudFormation to build our infrastructure as a classic programming project. This allows us to keep the state of our infrastructure Version Controled under git, and simply define our machines, IPs, Security Groups, … —all of these components together being called a stack— as a YAML configuration file. If your infrastructure is defined as code, it becomes much easier to maintain and collaborate on it!

CloudFormation has a pretty steep learning curve, not because of its innherent complexity, but more because of the wideness of the possibility it offers… The point of this article is not to get into CloudFormation details, but if you want to dig the topic deeper, here is a nice introduction.

Oh, and before I forgot: we set up a GitHub repo related to this article.

Components of our CloudFormation Stack

3 EC2 instances

Those will be the 3 pieces of our little cluster: one Master, two slaves. Hadoop being a big piece of software, I recommend not using an instance with less than 4GB of memory (consider 8GB if you can afford it).

3 Elastic IP addresses and their EC2 association

We don’t necessarily want our cluster to run permanently. However, by default, EC2 instances get different public IPs at every reboot, which can be annoying if you plan to bind the IPs to one of your domain (eg, to access easily your cluster’s web UI). Elastic IPs allow you to keep public IPs for a given EC2 instance’s life, with a small cost overhead.

3 Security Groups

- hadoop-master ports configuration specific to Hadoop’s Master node
- hadoop-slave ports configuration specific to Hadoop’s Slave nodes
- ssh a simple rule to authorize SSH-ing on your instances

Security Groups allow you to control your machines network access. You can see them individually as an OS-independant firewall rule, the advantage being that you only have to define the rules of a Security Group once, and then easily associate it with any machine within your EC2 pool.

Our stack’s template

Now that we have a better idea the components of our infrastructure, we can start defining it within a configuration file, called a template: this template will later be used for the provisionning of all the services it defines (the stack). To build my template, I will use YAML but JSON is fine too: use whatever you prefer.

To start, you will need to know how to describe any AWS resource you want to set up, so it can be understood by the CloudFormation runtime engine. The official documentation is (as usual) your best friend, use it intensively!

As you can see, a well-formated template always starts with the following keys:

AWSTemplateFormatVersion: 2010-09-09
Description: A 3-node Hadoop cluster and its associated security groups

Then, we are going to use two other keys: Resources (to describe the actual resources of the stack — it’s actually the only required key) and Parameters (so we can give values for some resources attributes at runtime). More info here.

Here is an example of how we describe one of our EC2 machines:

Resources:
MasterNode:
Type: AWS::EC2::Instance
Properties:
KeyName: !Ref KeyName
InstanceType: !Ref MasterInstanceType
ImageId: !Ref BaseImage
SecurityGroups:
- !Ref MasterSecurityGroup
- !Ref SSHSecurityGroup
Tags:
-
Key: Name
Value: Hadoop-Master

An important thing to notice is the Type key: this is where we describe in the CloudFormation language what type of AWS resource we are refering to. This, of course, has to be properly formatted, and you can find a full reference of all possible types here. Since we are describing an EC2 instance, the type to use is AWS::EC2::Instance.

Also, as you can see, we use several references with the keyword !Ref. Some of them refer to other resources (eg, the two SecurityGroups), while others refer to Parameters. Those parameters can be given at runtime when instantiating a new stack, or you can also give default value. Here is a snippet of our Parameters key, defining two parameters: the key name(refering to a SSH private key) with which we can access our EC2 instance, and the type of instance we want to spin up.

Parameters:
KeyName:
Description: EC2 key-pair to SSH on instance
Type: AWS::EC2::KeyPair::KeyName
MasterInstanceType:
Description: EC2 instance type for Master node
Type: String
Default: t3.medium

You can give those parameters the name you want, the only condition being to refer to them in the template with their chosen names.

A word about SecurityGroups

I like to configure SecurityGroups directly in the template, as it links a particular stack with pre-defined network rules. To create the rules, you first need to know which ports are going to be used for your particular applications.

We won’t go into much details here, as it is the topic of another article, but here are the default ports for HDFS in Hadoop 3 for the Master node — or NameNode: 9870 (web UI), 9868 (backup), 9000 (metadata service, can also be on port 8020).

Hence, here is a snippet of our SecurityGroup for the Master node instance.

MasterSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupName: hadoop-master
GroupDescription: Configure ports to allow basic Hadoop usage.
SecurityGroupIngress:
- CidrIp: !ref VPNPublicIp
IpProtocol: tcp
Description: Web UI for HDFS Name node
FromPort: 9870
ToPort: 9870
- CidrIp: !ref VPCCidr
IpProtocol: tcp
Description: HDFS metadata operations
FromPort: 9000
ToPort: 9000
... (to be continued)

You will note that we only allow requests from known IP addresses… The web interface should be accessible by a browser outside of our EC2 network, so we use a VPN on our laptops to always be identified as a legit request. For the Metadata service, we only authorize machines inside our VPC —ie, our cluster. To give us some flexibility (and also to avoid git-checking those values) we refer to them as Parameters that will be given at runtime: VPNPublicIP and VPCCidr.

Deploying the stack

We now have our template (full version can be found here).

The rest is pretty straightforward, provided that you already have an AWS account, and a command-line tool installed with the appropriate credentials. To set the appropriate authorizations among the SecurityGroups, you also need to have an existing VPC within your EC2 pool. As previously seen, we will need the CIDR (IP addresses block) of this VPC to authorize connections between our cluster nodes.

In a terminal, enter the following command:

aws cloudformation create-stack \
--template-body file://path/to/template.yaml \
--stack-name hadoop-cluster \
--parameters \
ParameterKey=KeyName,ParameterValue=my_SSH_key_name \
ParameterKey=VPCCidr,ParameterValue=my_VPC_CIDR

And that’s all! After a few minutes wait, you will have your 3 instances created, with fixed IP addresses and the appropriate Security Groups.

note If you prefer using the AWS console instead of the CLI, you can simply copy/paste the template in the wizard of the CloudFormation creation tool.

This concludes our little tutorial on setting up a 3-instance EC2 pool, using AWS CloudFormation.

Now we are ready to start configuring our Hadoop cluster! Following posts are coming shortly…

Any feedback on what you just read ? Or other topics you would like to learn about? Leave us a comment or drop us a message contact [at] nibble [dot] ai.

--

--

edouard
nibbleai

Living human loving history, cinema and computers.