Play with HDFS & configure YARN

Shehan Fernando
3 min readJun 8, 2020

--

Photo by Les Triconautes on Unsplash

Check the previous article to configure the Hadoop cluster.

Few Hadoop commands that help to manage files on HDFS.

DFS admin report
hdfs dfsadmin -report
HDFS filesystem checking utility
hdfs fsck /
Copy files from local file system (datasets from https://grouplens.org/datasets/movielens/)hadoop fs -copyFromLocal /home/shehan/ml-100k/ /user/shehan/
hadoop fs -put /home/shehan/ml-25m/ /user/shehan/
Hadoop ls
hadoop fs -ls /user/shehan/ml-25m
Remove all files
hadoop fs -rm /user/shehan/ml-25m/*
cat
hadoop fs -cat /user/shehan/ml-25m/movies.csv
mkdir
hadoop fs -mkdir /user/shehan/ml-25m/tmp
Rename(move)
hadoop fs -mv /user/shehan/ml-25m/tmp /user/shehan/ml-25m/tmp2
Recursive delete
hadoop fs -rm -r /user/shehan/ml-25m/tmp2

You can see the files via WebUI as well.

Configure YARN

NameNode (daemon for HDFS) and the ResourceManager (daemon for YARN), both of them are Java processes. NameNode and ResourceManager can reside in the same machine or different machine depending upon the configuration of the cluster.

Open mapred-site.xml & add properties
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>

<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>

<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>

“mapreduce.framework.name” can be one of “local”, “classic” or “yarn”.

  • “classic” stands for old MRv1.
  • “yarn” stands for MRv2.
  • In “local”, your mapper and reducer processes will be executed in the same JVM.
Open yarn-site.xml & Add properties
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master.hadoop.smf</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

Copy these updated files across the all the nodes

Ex:-
scp hadoop/etc/hadoop/mapred-site.xml shehan@hadoop-node1:/home/shehan/hadoop/etc/hadoop/
scp hadoop/etc/hadoop/yarn-site.xml shehan@hadoop-node1:/home/shehan/hadoop/etc/hadoop/

Start All

Start hadoop dfs deamon
start-dfs.sh
start yarn daemons
start-yarn.sh
In Masternode
[shehan@master ~]$ jps
3699 ResourceManager
3241 NameNode
3466 SecondaryNameNode
4010 Jps
In all Datanodes
[shehan@node1 ~]$ jps
2443 NodeManager
2334 DataNode
2558 Jps

Now we can see java processes like “ResourceManager” in name node and “NodeManager” in all data nodes are spawned.

Available nodes

yarn node -list
2020-06-07 15:05:27,661 INFO client.RMProxy: Connecting to ResourceManager at master.hadoop.smf/192.168.1.200:8032
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
node02.hadoop.smf:42655 RUNNING node02.hadoop.smf:8042 0
node03.hadoop.smf:33534 RUNNING node03.hadoop.smf:8042 0
node01.hadoop.smf:44680 RUNNING node01.hadoop.smf:8042 0
http://master.hadoop.smf:8088/

To make sure all things are working correctly try to run the sample from the Hadoop site.

https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

Well, I’m not interested in running any MapReduce or yarn jobs in this cluster. Running Spark on the Yarn cluster would be my final goal.

Next :-

Referances

https://hadoop.apache.org/docs/r1.2.1/commands_manual.html

--

--