Play with HDFS & configure YARN

3 min readJun 8, 2020

Check the previous article to configure the Hadoop cluster.

Make your PC into Hadoop cluster

Few Hadoop commands that help to manage files on HDFS.

DFS admin report
hdfs dfsadmin -reportHDFS filesystem checking utility
hdfs fsck /Copy files from local file system (datasets from https://grouplens.org/datasets/movielens/)hadoop fs -copyFromLocal /home/shehan/ml-100k/ /user/shehan/
hadoop fs -put /home/shehan/ml-25m/ /user/shehan/Hadoop ls
hadoop fs -ls /user/shehan/ml-25mRemove all files
hadoop fs -rm /user/shehan/ml-25m/*cat
hadoop fs -cat /user/shehan/ml-25m/movies.csvmkdir
hadoop fs -mkdir /user/shehan/ml-25m/tmpRename(move)
hadoop fs -mv /user/shehan/ml-25m/tmp /user/shehan/ml-25m/tmp2Recursive delete
hadoop fs -rm -r /user/shehan/ml-25m/tmp2

You can see the files via WebUI as well.

Configure YARN

NameNode (daemon for HDFS) and the ResourceManager (daemon for YARN), both of them are Java processes. NameNode and ResourceManager can reside in the same machine or different machine depending upon the configuration of the cluster.

Open mapred-site.xml & add properties
<property>                                             
        <name>mapreduce.framework.name</name>          
        <value>yarn</value>                            
</property>                                            
                                                       
<property>                                             
        <name>yarn.app.mapreduce.am.env</name>         
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> 
</property>                                            
                                                       
<property>                                             
        <name>mapreduce.map.env</name>                 
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> 
</property>                                            
                                                       
<property>                                             
        <name>mapreduce.reduce.env</name>              
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> 
</property>

“mapreduce.framework.name” can be one of “local”, “classic” or “yarn”.

“classic” stands for old MRv1.
“yarn” stands for MRv2.
In “local”, your mapper and reducer processes will be executed in the same JVM.

Open yarn-site.xml & Add properties
<property>
     <name>yarn.resourcemanager.hostname</name>
     <value>master.hadoop.smf</value>
</property><property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property><property>             
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
     <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

Copy these updated files across the all the nodes

Ex:-
scp hadoop/etc/hadoop/mapred-site.xml shehan@hadoop-node1:/home/shehan/hadoop/etc/hadoop/
scp hadoop/etc/hadoop/yarn-site.xml shehan@hadoop-node1:/home/shehan/hadoop/etc/hadoop/

Start All

Start hadoop dfs deamon
start-dfs.shstart yarn daemons
start-yarn.shIn Masternode
[shehan@master ~]$ jps
3699 ResourceManager
3241 NameNode
3466 SecondaryNameNode
4010 JpsIn all Datanodes
[shehan@node1 ~]$ jps
2443 NodeManager
2334 DataNode
2558 Jps

Now we can see java processes like “ResourceManager” in name node and “NodeManager” in all data nodes are spawned.

Available nodes

yarn node -list
2020-06-07 15:05:27,661 INFO client.RMProxy: Connecting to ResourceManager at master.hadoop.smf/192.168.1.200:8032
Total Nodes:3
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
node02.hadoop.smf:42655         RUNNING node02.hadoop.smf:8042                             0
node03.hadoop.smf:33534         RUNNING node03.hadoop.smf:8042                             0
node01.hadoop.smf:44680         RUNNING node01.hadoop.smf:8042                             0