Installation of HADOOP in Ubuntu

6 min readNov 3, 2023

Apache Hadoop is a Java-based, open-source, freely available software platform for storing and analyzing big datasets on your system clusters.

What is Hadoop?

Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

What you need to do

Install Java
Download Hadoop
Set environment
Edit Hadoop XML
start-dfs.sh
start-yarn.sh

Hadoop: https://hadoop.apache.org/

Ubuntu:https://ubuntu.com/

If success you will see

localhost:8088 → See Hadoop icon screen
localhost:9870 → See cluster status screen

Step 1: Install Java Development Kit

sudo apt update && sudo apt install openjdk-8-jdk

Step 2: Verify the Java version :

java -version

Step 3: Install SSH :

Before installing SSH or any new application in the system, it’s a good practice to update the package lists to ensure you have the latest versions of packages. Run the following command in the terminal:

sudo apt update

Step 4: Create the Hadoop user :

Hadoop doesn’t have users like Linux does. External LDAP/Kerberos systems generally manage users.

sudo adduser hadoop

Step 5: Switch user :

su - hadoop

Step 6: Configure SSH :

ssh-keygen -t rsa

Step 7: Set permissions

Place the generated public key into the authorized key file and ensure the correct permissions are set.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys

Step 8: SSH to the local-host

To allow git to use your SSH key, an SSH agent needs to be running on your device.

ssh localhost

Step 9: Switch user

Again, switch to Hadoop in the Ubuntu terminal

su - hadoop

Step 10: Install Hadoop Download Hadoop 3.3.6

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

To unzip the file, we can use this command

tar -xvzf hadoop-3.3.6.tar.gz

mv hadoop-3.3.6 hadoop

By using the above command, we are changing the name of the file

Next, in the process, we need to configure Hadoop and Java Environment Variables

nano ~/.bashrc

we have to insert the below code in the file

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

In the process of writing on the file, we use CTRL+O to exit from the file we use CTRL+X.

Load the above configuration in the current environment.

source ~/.bashrc

You also need to configure JAVA_HOME in the Hadoop-env.sh file.

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Search for the “export JAVA_HOME” and configure it.

JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Step 11: Configuring Hadoop :

First, you must create the name and data node directories inside the Hadoop user home directory. Run the following command to create both directories:

cd hadoop/

mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}

nano $HADOOP_HOME/etc/hadoop/core-site.xml

After the file opening, we have to add the command to the file

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

To close the file, we will use the same process.

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

<configuration>
<property>

<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

Then, edit the mapred-site.xml file:

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

<configuration>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</val
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</val
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</val
</property>
</configuration>

Then, edit the yarn-site.xml file:

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Save the file and close it

Step 12: Start Hadoop cluster:

https://www.databricks.com/glossary/hadoop-cluster

A Hadoop cluster is a collection of computers, known as nodes, that are networked to perform these kinds of parallel computations on big data sets.

hdfs namenode -format

start-all.sh

jps

ERROR: In the the case of error in the above process like

hadoop@lab-H410M-H:~/hadoop$ start-all.sh
WARNING: Attempting to start all Apache Hadoop daemons as hadoop in 10 seconds.
WARNING: This is not a recommended production deployment configuration.
WARNING: Use CTRL-C to abort.
Starting namenodes on [localhost]
pdsh@lab-H410M-H: localhost: rcmd: socket: Permission denied
Starting datanodes
pdsh@lab-H410M-H: localhost: rcmd: socket: Permission denied
Starting secondary namenodes [lab-H410M-H]
pdsh@lab-H410M-H: lab-H410M-H: rcmd: socket: Permission denied
Starting resourcemanager
Starting nodemanagers
pdsh@lab-H410M-H: localhost: rcmd: socket: Permission denied

the solution LINK

pdsh -q -w localhost

export PDSH_RCMD_TYPE=ssh

sbin / start-dfs.sh

After that give the above code to start the name node again

Step 13: Access Hadoop Namenode and Resource Manager :

NameNode:

NameNode works on the Master System. The primary purpose of Namenode is to manage all the metadata. Metadata is the files stored in HDFS(Hadoop Distributed File System). As we know, the data is stored as blocks in a Hadoop cluster. So, the DataNode on which or the location at which that block of the file is stored is mentioned in MetaData.

The below code has to run in a new terminal

sudo apt install net-tools

ifconfig

Through this process, we can get a Localhost ID to access the HADOPP environment.

Step 13: Verify the Hadoop Cluster :

At this point, the Hadoop cluster is installed and configured. Next, we will create
some directories in the HDFS filesystem to test the Hadoop.
Let’s create some directories in the HDFS filesystem using the following
Command:

hdfs dfs -mkdir /test1
hdfs dfs -mkdir /logs

hdfs dfs -ls /

hdfs dfs -put /var/log/* /logs/

By the above code, we can access the Hadoop

Step 14: To stop Hadoop services :

stop-all.sh

This code will stop the Hadoop process and exit the process

Conclusion:

In this tutorial, you’ve installed Hadoop in stand-alone mode and verified it by running an example program it provided. To learn how to write your own Map Reduce programs, you might want to visit Apache Hadoop’s Map-reduce tutorial which walks through the code behind the example. When you’re ready to set up a cluster, see the Apache Foundation Hadoop Cluster Setup guide.