Hadoop: Setting up a Single Node Cluster in Windows
Install and configure the pseudo-distributed mode of Hadoop 3.1 in Windows 10 by setting up a single node cluster.
Installing a virtual machine requires allocation of a large amount of RAM for it to function smoothly else it would hang constantly.
This article will explain how to install and configure single-node pseudo-distributed Hadoop 3.1 cluster on Windows 10 without a virtual machine.
Prerequisite:
Java should be installed in the system before installing Hadoop.
Install java
version 1.8 in your system. If it is already installed, skip this part and move further.
If java
is not installed in your system, then go this link.
Accept the license and download the file according to your operating system.
Note: Instead of saving as
C:\Program Files\Java\jdk1.8.0_261
save the java folder directly under the local disk directory asC:\Java\jdk1.8.0_261
to avoid further errors.
After downloading java
check your java version through this command on command prompt (cmd
).
Download Hadoop
Download hadoop version 3.1 from this link.
Extract it to a folder.
Note: Both the
java
folder andhadoop
folder should be placed in a single drive. (Here,C:\
drive). This is done to avoid further errors.
Setup System Environment Variables
To edit the system environment variable, open the control panel and then go to environment variable in system properties.
We need to create two new user variables:
- Variable name :
HADOOP_HOME
Variable value : The path of the bin
folder where you extracted hadoop
.
2. Variable name : JAVA_HOME
Variable value : the path of the bin
folder in the Java
directory.
To set Hadoop
bin
directory and Java
bin
directory path in system variable path, edit Path in the system variable
Click on New
and add the bin
directory path of Hadoop
and Java
in it.
Note: Both the
bin
directories should be placed in a single drive. (Here,C:\
drive). This is done to avoid further errors.
Configurations
Now we need to edit some files located in the hadoop
directory of the etc
folder where we installed hadoop
. ( Here, C:\hadoop-3.1.0\hadoop-3.1.0\etc\hadoop\
). The files which are to be edited have been highlighted (in yellow colour).
- Edit the
core-site.xml
file in thehadoop
directory. Copy thisxml
property in the configuration in the file and save it.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
2. Edit mapred-site.xml
and copy this property in the configuration and save it.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
3. Create a folder data
in the hadoop
directory
HDFS has a master-slave architecture where the master node is called
NameNode
and slave node is calledDataNode
. TheNameNode
and itsDataNodes
form a cluster.NameNode
acts like an instructor toDataNode
while theDataNodes
store the actual data.
Master-slave architectures are used to help stabilize a system. Master is the true data keeper while a slave is a replication of master. Replication is the process of synchronizing data from the master to slave.
Create two new empty folders with the names datanode
and namenode
in this newly createddata
directory. (Here, C:\hadoop-3.1.0\hadoop-3.1.0\data\namenode
and C:\hadoop-3.1.0\hadoop-3.1.0\data\datanode
are the paths of namenode and datanode folders respectively.).
4. Edit the file hdfs-site.xml
and add below property in the configuration and save it.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-3.1.0\hadoop-3.1.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value> C:\hadoop-3.1.0\hadoop-3.1.0\data\datanode</value>
</property>
</configuration>
Note: The path of
namenode
anddatanode
across value would be the path of thenamenode
anddatanode
folders you just created as per the step above.
(Here, C:\hadoop-3.1.0\hadoop-3.1.0\data\namenode
and C:\hadoop-3.1.0\hadoop-3.1.0\data\datanode
respectively).
5. Edit the file yarn-site.xml
and add below property in the configuration and save it.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property> <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
6. Edit hadoop-env.cmd
.
Replace %JAVA_HOME%
with the path of the java
folder where the jdk 1.8
is installed. (Here, C:\Java\jdk1.8.0_202
which is highlighted in yellow). Then save it.
Hadoop needs Windows OS-specific files which do not come with default download of Hadoop.
To include those files, replace the bin
folder in hadoop
directory with the bin
folder provided in this GitHub
link.
https://github.com/s911415/apache-hadoop-3.1.0-winutils
Download it as zip
file. Extract it and copy the bin
folder in it. If you want to save the old bin
folder, rename it like bin_old
.
Now paste the copied bin
folder in that directory.
Note: The new
bin
folder has 15 files in it.
Check whether hadoop
is successfully installed by running this command on cmd
-
hadoop version
Since it doesn’t throw any error and successfully shows the hadoop
version, Congrats; Hadoop is successfully installed in the system and has successfully reached the half the way. If your case is different, you should have missed something. Go back and recheck. Else you can’t move forward.
Format the NameNode
Once the hadoop
is installed, the NameNode
is formatted. This is done to avoid deletion of all the data inside HDFS
. Run this command-
hdfs namenode –format
One last thing
Copy hadoop-yarn-server-timelineservice-3.1.0
from timelineservice
folder in yarn
directory which is in hadoop
located in the hadoop
directory of the share
folder where we installed hadoop
to yarn
directory which is in hadoop
located in the hadoop
directory of the share
folder where we installed hadoop
.
ie., \hadoop-3.1.0\share\hadoop\yarn\timelineservice
to \hadoop-3.1.0\share\hadoop\yarn folder
.
(Here, C:\hadoop-3.1.0\hadoop-3.1.0\share\hadoop\yarn\timelineservice
to C:\hadoop-3.1.0\hadoop-3.1.0\share\hadoop\yarn folder
.)
Copy hadoop-yarn-server-timelineservice-3.1.0
Paste hadoop-yarn-server-timelineservice-3.1.0
in yarn
folder
To start run all the Apache Hadoop Distribution
Now change the directory in cmd
to sbin
folder of hadoop
directory with this command,
Note: Make sure you are writing the path as per your system. (Here,
C:\hadoop-3.1.0\hadoop-3.1.0\sbin
)
cd C:\hadoop-3.1.0\hadoop-3.1.0\sbin
Start namenode
and datanode
with this command –
start-dfs.cmd
Two more cmd windows will open for NameNode
and DataNode
Now start yarn
through this command-
start-yarn.cmd
Two more windows will open, one for yarn resource manager
and one for yarn node manager
.
Now all working fine. 😇
Note: Make sure all the 4 Apache Hadoop Distribution windows (
hadoop namenode
,hadoop datanode
,yarn nodemanager
,yarn resourcemanager
) pops up and are running. If they are not running, you will see an error or a shutdown message. In such a case, you need to debug the error.
Verification
To access information about resource manager current jobs, successful and failed jobs, go to this link in browser- http://localhost:8088/cluster
To check the details about the hdfs
(namenode
and datanode
), go to this link in browser- http://localhost:50070/
Note: If you are using
Hadoop
version prior to 3.0.0 — Alpha 1, then use port http://localhost:50070/
Conclusion
The term Hadoop
is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig
, Apache Hive
, Apache HBase
, Apache Phoenix
, Apache Spark
, Apache ZooKeeper
, Cloudera Impala
, Apache Flume
, Apache Sqoop
, Apache Oozie
, and Apache Storm
. You can download this software as well in your windows system to perform data processing operations using cmd.
Hadoop MapReduce
can be used to perform data processing activity. However, it possessed limitations due to which frameworks like Spark
and Pig
emerged and have gained popularity. A 200 lines of MapReduce
code can be written with less than 10 lines of Pig
code.