Installing Hadoop 3.2.1 in Windows 10 + basic word count example

Published in

MCD-UNISON

7 min readFeb 11, 2022

This article will be part of a series of entries, exploring different data analysis tools and procedures, that aims to illustrate different technologies for data science.

On this medium entry, we’ll talk about Hadoop, its way to install on Windows, some notes for the installation and making a basic text analysis.

What is Hadoop

According to AWS page:

Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

How it works

Hadoop consists of four main modules:

Hadoop Distributed File System (HDFS): HDFS provides a distributed file system with a better data throughput and support of large datasets, than traditional file systems.
Yet Another Resource Negotiator (YARN) — Manages and monitors cluster nodes and resource usage.
MapReduce — A tool that helps programs do the parallel computation on data.
Hadoop Common — Java libraries that can be used on all modules.

In Hadoop, applications place data into the Hadoop cluster, by connecting to the NameNode. The NameNode tracks the file directory structure and the placement of “chunks” for each file, replicated across DataNodes.

To query the data, Hadoop creates a MapReduce job, made up of map and reduce tasks that run against the data spread across the DataNodes. Map tasks run on each node against the input files supplied, and reducers run to aggregate and organize the final output.

Installing Hadoop

Now that we know the basics of Hadoop, we will proceed to install it on a Windows 10 environment.

Starting notes:

The language of the OS in which Hadoop will be installed, should be english.

Prerequisites

First, we should have some programs installed.

Java 8 Runtime Environment (JRE): Download here
Java 8 Development Kit (JDK): Download here
WinRar o 7Zip

1. Downloading Hadoop binaries

We should download Hadoop binaries from the official site: Download here

Once downloaded the file, notice the file has a “.tar” extension. We should extract the files by using Winrar o 7zip

The following folder will appear: “hadoop-3.2.1.tar”, copy it and paste it on your “C:” drive, so the route to the folder will be → “C:\hadoop-3.2.1”.

Since we already extracted Hadoop folder on C:, now we should download the windows libraries for Hadoop from this GitHub repository: https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.1/bin and copy them into the “C:\hadoop-3.2.1\bin” directory.

2. Configuring environment variables

After installing Hadoop, we should configure the environment variables to define Hadoop and Java paths.

To edit environment variables, go to Control Panel > System and Security > System (or right-click > properties on My Computer icon) and click on the “Advanced system settings” link.

When the “Advanced system settings” dialog appears, go to the “Advanced” tab and click on the “Environment variables” button located on the bottom of the dialog.

In the “Environment Variables” dialog, press the “New” button to add a new variable.

Now we need to configure two variables:

JAVA_HOME: This route should aim to the JDK installation folder

Note: The default installation folder for java JDK is: “C:\Program Files\Java\jre1.8.0_321”, I suggest using the wildcard “Progra~1” to avoid problems with the blank space on “Program Files”.

HADOOP_HOME: This should aim to the Hadoop folder we created on C:

Now, on the environment variable window, we need to select the path variable and click the “Edit” button.

Once inside the path variable configuration window, click “new” button and add the following variables.

3. Configuring Hadoop cluster

Once we configured the environment variables, we should create two folders for the Name Node (Where all the master node data will be stored) and Data node (Where all data will be stored). So create the following folders on the specified routes:

C:\Program_files\hadoop-3.2.1\data\dfs\datanode
C:\Program_files\hadoop-3.2.1\data\dfs\namenode

Next, we must configure 4 internal files:

1.- C:\Program_files\hadoop-3.2.1\etc\hadoop\hdfs-site.xml

Note that the replication factor is set to 1 since we are creating a single node cluster.

2.- C:\Program_files\hadoop-3.2.1\etc\hadoop\core-site.xml

3.- C:\Program_files\hadoop-3.2.1\etc\hadoop\mapred-site.xml

4.- C:\Program_files\hadoop-3.2.1\etc\hadoop\yarn-site.xml

4. Formatting the Name node

After we finished configuration the above files, we execute the following command in order to format the name node:

Note: We must open a command prompt as an administrator

hdfs namenode -format

Due to a bug in the Hadoop 3.2.1 release, you will receive an error:

ERROR namenode.NameNode: Failed to start namenode.

To fix this error, we’ll have to download files from this link and Copy the downloaded hadoop-hdfs-3.2.1.jar to folder “C:\Program_files\hadoop-3.2.1\share\hadoop\hdfs”.

Once we replaced the file mentioned above, we run again the command:

Note: Remember to run the command prompt as administrator

hdfs namenode -format

If the command prompt asks for authorization, just accept and type “Y”

5. Starting Hadoop

Finally, we have configured Hadoop, the only thing missing is to start the services, for that, we’ll need to navigate to “C:\Program_files\hadoop-3.2.1\sbin” and run the following command:

.\start-dfs.cmd

After running this command, two command prompts will pop up, one for the name node and the other for the data node. Next we will start the Yarn service with the folowing command:

./start-yarn.cmd

After running this command, two command prompts will pop up, one for the resource manager and the other for the node manager.

To verify everything is ok, we’ll run the following command:

jps

If everything is ok, we should get a similar message to the following:

19840 DataNode
14856 NodeManager
17288 Jps
15644 ResourceManager
5308 NameNode

Now we can access three pages

1.- Name node web page: http://localhost:9870/dfshealth.html
2.- Data node web page: http://localhost:9864/datanode.html
3.- Yarn web page: http://localhost:8088/cluster

NOTE: If after you end services and start again, the command prompt shows a problem on the \temp folder, open a command prompt as administrator and run the following command: (The problem is due to permissions of the Hadoop folders)

cacls hadoop-3.2.1 /t /p everyone:f

Basic word count example using Hadoop’s Mapreduce

In this section, we’ll analyze the most common words in the novel “Don Quijote de la Mancha” by Miguel de Cervantes Saavedra.

For this purpose, we’ll need to download the novel as a plain text file, which you can find on this link: https://gist.githubusercontent.com/jsdario/6d6c69398cb0c73111e49f1218960f79/raw/8d4fc4548d437e2a7203a5aeeace5477f598827d/el_quijote.txt

Once downloaded, move the file to “C:”.

Open a command prompt as administrator and run the following command to create an input and output folder on the Hadoop file system, to which we will be moving the novel for our analysis.

hadoop fs -mkdir /input_dir
hadoop fs -mkdir /output_dir

We can verify the folder we just created on the following link: http://localhost:9870/explorer.html#/

Now let’s move the file “el_quijote.txt”, from C: to /input_dir

hadoop fs -put C:/el_quijote.txt /input_dir

We can verify the file has moved successfully on the following link:

http://localhost:9870/explorer.html#/input_dir

or with the following command line:

hadoop fs -ls /input_dir

For the next step, we’ll need the “hadoop-mapreduce-examples-3.2.1.jar”, found in “C:\hadoop-3.2.1\share\hadoop\mapreduce”.

We’ll execute the following command in order to call the function wordcount inside the mapreduce.

hadoop jar C:\hadoop-3.2.1\share\hadoop\mapreduce\hadoop-mapreduce-examples-3.2.1.jar wordcount /input_dir/el_quijote.txt /output_dir

Once the script is executed, the following message will appear on the command prompt

Now you can find the results of the wordcount on the output_dir we created.

Open up the output_dir folder inside the Hadoop interface on the following link: http://localhost:9870/explorer.html#/output_dir and click the file named “part-r-00000”(1), and a small window will show up, click the “download”(2) option, and there you will be able to see the number of appearances for each word in the novel.

Now we are done!