Play with HDFS & configure YARN
Check the previous article to configure the Hadoop cluster.
Few Hadoop commands that help to manage files on HDFS.
DFS admin report
hdfs dfsadmin -reportHDFS filesystem checking utility
hdfs fsck /Copy files from local file system (datasets from https://grouplens.org/datasets/movielens/)hadoop fs -copyFromLocal /home/shehan/ml-100k/ /user/shehan/
hadoop fs -put /home/shehan/ml-25m/ /user/shehan/Hadoop ls
hadoop fs -ls /user/shehan/ml-25mRemove all files
hadoop fs -rm /user/shehan/ml-25m/*cat
hadoop fs -cat /user/shehan/ml-25m/movies.csvmkdir
hadoop fs -mkdir /user/shehan/ml-25m/tmpRename(move)
hadoop fs -mv /user/shehan/ml-25m/tmp /user/shehan/ml-25m/tmp2Recursive delete
hadoop fs -rm -r /user/shehan/ml-25m/tmp2
You can see the files via WebUI as well.
Configure YARN
NameNode (daemon for HDFS) and the ResourceManager (daemon for YARN), both of them are Java processes. NameNode and ResourceManager can reside in the same machine or different machine depending upon the configuration of the cluster.
Open mapred-site.xml & add properties
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
“mapreduce.framework.name” can be one of “local”, “classic” or “yarn”.
- “classic” stands for old MRv1.
- “yarn” stands for MRv2.
- In “local”, your mapper and reducer processes will be executed in the same JVM.
Open yarn-site.xml & Add properties
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master.hadoop.smf</value>
</property><property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property><property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
Copy these updated files across the all the nodes
Ex:-
scp hadoop/etc/hadoop/mapred-site.xml shehan@hadoop-node1:/home/shehan/hadoop/etc/hadoop/
scp hadoop/etc/hadoop/yarn-site.xml shehan@hadoop-node1:/home/shehan/hadoop/etc/hadoop/
Start All
Start hadoop dfs deamon
start-dfs.shstart yarn daemons
start-yarn.shIn Masternode
[shehan@master ~]$ jps
3699 ResourceManager
3241 NameNode
3466 SecondaryNameNode
4010 JpsIn all Datanodes
[shehan@node1 ~]$ jps
2443 NodeManager
2334 DataNode
2558 Jps
Now we can see java processes like “ResourceManager” in name node and “NodeManager” in all data nodes are spawned.
Available nodes
yarn node -list
2020-06-07 15:05:27,661 INFO client.RMProxy: Connecting to ResourceManager at master.hadoop.smf/192.168.1.200:8032
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
node02.hadoop.smf:42655 RUNNING node02.hadoop.smf:8042 0
node03.hadoop.smf:33534 RUNNING node03.hadoop.smf:8042 0
node01.hadoop.smf:44680 RUNNING node01.hadoop.smf:8042 0
To make sure all things are working correctly try to run the sample from the Hadoop site.
Well, I’m not interested in running any MapReduce or yarn jobs in this cluster. Running Spark on the Yarn cluster would be my final goal.
Next :-