Integration of Hadoop with AWS | Launching Multi-Node Clusters

7 min readOct 7, 2020

Hadoop, our favourite elephant, is an open-source framework that allows you to store and analyse big data across clusters of computers.

What is a Distributed File System?

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.

NameNode and DataNodes

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files.

Data Replication

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

THINGS ONE SHOULD KNOW BEFORE PROCEEDING WITH PRACTICAL

Cache memory is an extremely fast memory type that acts as a buffer between RAM and the CPU. It holds frequently requested data and instructions so that they are immediately available to the CPU when needed. Cache memory is used to reduce the average time to access data from the Main memory.

PuTTY Is A Free & Open-Source Terminal Emulator, Serial Console & Network FileTransfer Software (Application)

• It Supports Many Variations On The Secure Remote Terminal & Provides User Control Over The SSH Encryption Key & Protocol Version

• To Convert .pem file to .ppk File We Use “PuTTY Gen” Software

• To Transfer Files Over A Network “SCP” Protocol Is Used & “WinSCP” Software Uses This Protocol

File Used To Clean Cache — /proc/sys/vm/drop-caches • To Clean Cache — echo 3 > /proc/sys/vm/drop-caches • Name Node IP — Neutral — As Public IP Of A System Can Not Connect To Private IP Of Another System — So It Need’s Public IP To Connect • In Data Node Configuration File — Public IP Of Name Node Is Written As To Establish A Connection

HANDS-ON HADOOP:

✴ Setup a Hadoop cluster with one NameNode(Master) and 4 DataNodes(Slave) and one Client Node.

✴ Upload a file through client to the NameNode.

✴Check which DataNode the Master chooses to store the file.

✴ Once uploaded try to read the file through Client using the cat command and while Master is trying to access the file from that DataNode where it stored the file delete that DataNode or crash it and see with help of the replicated storage how master retrieves the file and present it to client.

REQUIREMENTS:

We used Virtual Machine to launch one CientNode. and one was instance to launch second ClientNode.
AWS EC2 service to launch 4instances
Master, 2 client,and 2 Slave-nodes.
Instances are of Red Hat Enterprise Linux 8
Hadoop 1.2.1 and jdk-8u171 software on op of all these nodes.
Putty software for connecting to aws instances.

Let’s begin our interesting Hadoop tutorial.

STEP 1: LAUNCHING REDHAT INSTANCES ON AWS

We now used WINSCP to transfer hadoop and jdk files to install on our instances.

We used putty software.

STEP 2:CONFIGURING DATANODES AND NAMENODES

After installing Hadoop on all nodes and VM we will now configure hdfs-site.xml and core-site.xml files in all nodes except Client Instance.

hdfs-site.xml configuration for the NameNode:<configuration>
<property>
<name>dfs.name.dir</name>
<value>/nn</value>
</property>
</configuration>core-site.xml configuration for the NameNode: ip add 0.0.0.0 only use in master node.<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:9001</value>
</property>
</configuration>

run this command only in master:


hadoop namenode -format

FOR DATANODES: we will use master IP.

hdfs-site.xml configuration for all the DataNodes:<configuration>
<property>
<name>dfs.data.dir</name>
<value>/dn</value>
</property>
</configuration>core-site.xml configuration for all the DataNodes and the Client
here use public ipv4 aws ip of masternode<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs:mip:9001</value>
</property>
</configuration>

STEP 3: SETTING UP CLIENT NODE

In client node we configure the core-site.xml file.

<configuration><property><name>fs.default.name</name><value>hdfs://IPOFMASTER:9001</value></property></configuration>our master ip: 13.232.30.72

STEP 5: LET’S RUN!!

Now we will start hadoop program.

RUNNING NAMENODE:

RUNNING DATANODE:

TO CHECK IF DATANODE IS CONNECTED WE RUN THE COMMAND

jps

FROM MASTER POINT OF VIEW:

WE CONNECTED 2 DATANODES.

CLIENT NODES

Client in hadoop is a user instance by which we can handle the name node storage which comes up with the aggregation of data nodes .We can perform many activities via client like upload files or download files,delete and many more

Any computer system is a client to a Hadoop cluster if it uses the cluster to store the data.

One has to Install Hadoop software and configure IP address and port number of the name node on a Hadoop client

To Configure Client In Hadoop Cluster — Update In “core-site.xml” File With “fs.default.name” (Keyword) & HDFS(Protocol), Master Node IP With Port No

Running ClientNodes.

STEP 4 :

UPLOADING FILE>>Upload a file through the client to NameNode.

We have 2 clients.

We can hadoop cluster report status by web ui and also by cli .In webui we use port no 50070 as Hadoop master runs on port no 50070

STEP 5: STOPPING ONE DATANODE TO CHECK HOW HADOOP COPES WITH DATANODES FAILURE

We now stopped one of our Datanode and monitored the real-time packet transfers using tcpdump.

For this we have to install tcpdump:

yum install tcpdump -y

Stop one instance and another will connect to serve the masternode.

In this pic,you can see only one slavenode is connected.

We got one dead datanode.

We had installed tcpdump, let’s monitor.

We can see a packet being sent from our stopped node to another data node having public IP 112.79.138.14.

TASK COMPLETED!!

Let’s conclude..

We learned how to setup Hadoop clusters.Created a datanode on AWS using REDHAT instance.
Describe what a file system is
• Explain the reasons to have distributed file systems and how it helps big data analysis
Connected datanodes ,client and master all from different locations with help of my teammates.
Client in hadoop cluster is someone who uses the storage and can access the storage. We need to connect client to name node for this we need to configure the client.
Whenever we create EC2 instances, IP is assigned on demand. So when we stop the instance we are assigned a new IP. We can lose all the data nodes because of this.
We checked how may datanodes are live currently, how many dead and also total number of datanodes.
Succesfully uploaded files from 2 clients.
Successfully determined the IP of those DataNodes which transfer the data in blocks to the client while fetching file content in realtime.

Atlast I would like to Thank all My Team mates for actively participating and completing this task.

Thanks for reading!