Hadoop: Setting up a Single Node Cluster in Windows

Install and configure the pseudo-distributed mode of Hadoop 3.1 in Windows 10 by setting up a single node cluster.

Published in

Analytics Vidhya

8 min readOct 11, 2020

Installing a virtual machine requires allocation of a large amount of RAM for it to function smoothly else it would hang constantly.

This article will explain how to install and configure single-node pseudo-distributed Hadoop 3.1 cluster on Windows 10 without a virtual machine.

Prerequisite:

Java should be installed in the system before installing Hadoop.

Install java version 1.8 in your system. If it is already installed, skip this part and move further.

If java is not installed in your system, then go this link.

Accept the license and download the file according to your operating system.

Note: Instead of saving as C:\Program Files\Java\jdk1.8.0_261 save the java folder directly under the local disk directory as C:\Java\jdk1.8.0_261 to avoid further errors.

After downloading java check your java version through this command on command prompt (cmd ).

Download Hadoop

Download hadoop version 3.1 from this link.

Extract it to a folder.

Note: Both the java folder and hadoop folder should be placed in a single drive. (Here, C:\ drive). This is done to avoid further errors.

Setup System Environment Variables

To edit the system environment variable, open the control panel and then go to environment variable in system properties.

We need to create two new user variables:

Variable name : HADOOP_HOME

Variable value : The path of the bin folder where you extracted hadoop.

2. Variable name : JAVA_HOME

Variable value : the path of the bin folder in the Java directory.

To set Hadoop bin directory and Java bin directory path in system variable path, edit Path in the system variable

Click on New and add the bin directory path of Hadoop and Java in it.

Note: Both the bin directories should be placed in a single drive. (Here, C:\ drive). This is done to avoid further errors.

Configurations

Now we need to edit some files located in the hadoop directory of the etc folder where we installed hadoop. ( Here, C:\hadoop-3.1.0\hadoop-3.1.0\etc\hadoop\). The files which are to be edited have been highlighted (in yellow colour).

Edit the core-site.xml file in the hadoop directory. Copy this xml property in the configuration in the file and save it.

<configuration>
   <property>
      <name>fs.defaultFS</name>
      <value>hdfs://localhost:9000</value>
   </property>
</configuration>

2. Edit mapred-site.xml and copy this property in the configuration and save it.

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

3. Create a folder data in the hadoop directory

HDFS has a master-slave architecture where the master node is called NameNode and slave node is called DataNode. The NameNode and its DataNodes form a cluster. NameNode acts like an instructor to DataNode while the DataNodes store the actual data.

Master-slave architectures are used to help stabilize a system. Master is the true data keeper while a slave is a replication of master. Replication is the process of synchronizing data from the master to slave.

Create two new empty folders with the names datanode and namenode in this newly createddata directory. (Here, C:\hadoop-3.1.0\hadoop-3.1.0\data\namenode and C:\hadoop-3.1.0\hadoop-3.1.0\data\datanode are the paths of namenode and datanode folders respectively.).

4. Edit the file hdfs-site.xml and add below property in the configuration and save it.

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.namenode.name.dir</name>
      <value>C:\hadoop-3.1.0\hadoop-3.1.0\data\namenode</value>
   </property>
   <property>
      <name>dfs.datanode.data.dir</name>
      <value> C:\hadoop-3.1.0\hadoop-3.1.0\data\datanode</value>
   </property>
</configuration>

Note: The path of namenode and datanode across value would be the path of the namenode and datanode folders you just created as per the step above.

(Here, C:\hadoop-3.1.0\hadoop-3.1.0\data\namenode and C:\hadoop-3.1.0\hadoop-3.1.0\data\datanode respectively).

5. Edit the file yarn-site.xml and add below property in the configuration and save it.

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
   <property>    <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>
</configuration>

6. Edit hadoop-env.cmd .

Replace %JAVA_HOME% with the path of the java folder where the jdk 1.8 is installed. (Here, C:\Java\jdk1.8.0_202 which is highlighted in yellow). Then save it.

Hadoop needs Windows OS-specific files which do not come with default download of Hadoop.

To include those files, replace the bin folder in hadoop directory with the bin folder provided in this GitHub link.

https://github.com/s911415/apache-hadoop-3.1.0-winutils

Download it as zip file. Extract it and copy the bin folder in it. If you want to save the old bin folder, rename it like bin_old .

Now paste the copied bin folder in that directory.

Note: The new bin folder has 15 files in it.

Check whether hadoop is successfully installed by running this command on cmd-

hadoop version

Since it doesn’t throw any error and successfully shows the hadoop version, Congrats; Hadoop is successfully installed in the system and has successfully reached the half the way. If your case is different, you should have missed something. Go back and recheck. Else you can’t move forward.

Format the NameNode

Once the hadoop is installed, the NameNode is formatted. This is done to avoid deletion of all the data inside HDFS. Run this command-

hdfs namenode –format

One last thing

Copy hadoop-yarn-server-timelineservice-3.1.0 from timelineservice folder in yarn directory which is in hadoop located in the hadoop directory of the share folder where we installed hadoop to yarn directory which is in hadoop located in the hadoop directory of the share folder where we installed hadoop.

ie., \hadoop-3.1.0\share\hadoop\yarn\timelineservice to \hadoop-3.1.0\share\hadoop\yarn folder.

(Here, C:\hadoop-3.1.0\hadoop-3.1.0\share\hadoop\yarn\timelineservice to C:\hadoop-3.1.0\hadoop-3.1.0\share\hadoop\yarn folder.)

Copy hadoop-yarn-server-timelineservice-3.1.0

Paste hadoop-yarn-server-timelineservice-3.1.0 in yarn folder

To start run all the Apache Hadoop Distribution

Now change the directory in cmd to sbin folder of hadoop directory with this command,

Note: Make sure you are writing the path as per your system. (Here, C:\hadoop-3.1.0\hadoop-3.1.0\sbin )

cd C:\hadoop-3.1.0\hadoop-3.1.0\sbin

Start namenode and datanode with this command –

start-dfs.cmd

Two more cmd windows will open for NameNode and DataNode

Now start yarn through this command-

start-yarn.cmd

Two more windows will open, one for yarn resource manager and one for yarn node manager.

Now all working fine. 😇

Note: Make sure all the 4 Apache Hadoop Distribution windows (hadoop namenode, hadoop datanode, yarn nodemanager, yarn resourcemanager) pops up and are running. If they are not running, you will see an error or a shutdown message. In such a case, you need to debug the error.

Verification

To access information about resource manager current jobs, successful and failed jobs, go to this link in browser- http://localhost:8088/cluster

To check the details about the hdfs (namenode and datanode), go to this link in browser- http://localhost:50070/

Note: If you are using Hadoop version prior to 3.0.0 — Alpha 1, then use port http://localhost:50070/

Conclusion

The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm. You can download this software as well in your windows system to perform data processing operations using cmd.

Hadoop MapReduce can be used to perform data processing activity. However, it possessed limitations due to which frameworks like Spark and Pig emerged and have gained popularity. A 200 lines of MapReduce code can be written with less than 10 lines of Pig code.