Apache Hadoop 3.3.6 Installation on Ubuntu 22.04

Abhik Dey
7 min readNov 7, 2023

--

In the ever-expanding world of big data, managing and processing vast amounts of information efficiently has become paramount for businesses and organizations. At the forefront of this data revolution stands Hadoop, a powerful open-source framework designed to tackle the challenges of distributed data storage and processing. If you’re eager to harness the potential of Hadoop for your data projects but find the installation process daunting, fear not! This comprehensive guide will walk you through the essential steps to install Hadoop on your Ubuntu system, demystifying the process and setting you on the path to unlocking the limitless possibilities of big data analytics.

Whether you are a seasoned data professional looking to bolster your skills or someone just beginning your data journey, this guide aims to make Hadoop more accessible than ever before. By the end of this journey, you’ll be well-equipped to explore the world of distributed data processing, leverage the capabilities of Hadoop, and harness its full potential for your data-driven endeavors. So, let’s dive in and embark on an exciting journey into the realm of big data analytics.

Step 1 : Install Java Development Kit

The default Ubuntu repositories contain Java 8 and Java 11 both. I am using Java 8 because hive only works on this version.Use the following command to install it.

sudo apt update && sudo apt install openjdk-8-jdk

Step 2 : Verify the Java version :

Once you have successfully installed it, check the current Java version:

java -version

Step 3 : Install SSH :

SSH (Secure Shell) installation is vital for Hadoop as it enables secure communication between nodes in the Hadoop cluster. This ensures data integrity, confidentiality, and allows for efficient distributed processing of data across the cluster.

sudo apt install ssh

Step 4 : Create the hadoop user :

All the Hadoop components will run as the user that you create for Apache Hadoop, and the user will also be used for logging in to Hadoop’s web interface.

Run the command to create user and set password :

sudo adduser hadoop

Step 5 : Switch user :

Switch to the newly created hadoop user:

su - hadoop

Step 6 : Configure SSH :

Now configure password-less SSH access for the newly created hadoop user, so I didn’t enter key to save file and passpharse. Generate an SSH keypair first:

ssh-keygen -t rsa

Step 7 : Set permissions :

Copy the generated public key to the authorized key file and set the proper permissions:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys   
chmod 640 ~/.ssh/authorized_keys

Step 8 : SSH to the localhost

ssh localhost

You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes and hit Enter to authenticate the localhost.

Step 9 : Switch user

Again switch to hadoop

su - hadoop

Step 10 : Install hadoop

  • Download hadoop 3.3.6
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
  • Once you’ve downloaded the file, you can unzip it to a folder.
tar -xvzf hadoop-3.3.6.tar.gz
  • Rename the extracted folder to remove version information. This is an optional step, but if you don’t want to rename, then adjust the remaining configuration paths.
mv hadoop-3.3.6 hadoop
  • Next, you will need to configure Hadoop and Java Environment Variables on your system. Open the ~/.bashrc file in your favorite text editor.Here I am using nano editior , to pasting the code we use ctrl+shift+v for saving the file ctrl+x and ctrl+y ,then hit enter:
nano ~/.bashrc
  • Append the below lines to the file.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
  • Load the above configuration in the current environment.
source ~/.bashrc
  • You also need to configure JAVA_HOME in hadoop-env.sh file. Edit the Hadoop environment variable file in the text editor:
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Search for the “export JAVA_HOME” and configure it .

JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Step 11 : Configuring Hadoop :

  • First, you will need to create the namenode and datanode directories inside the Hadoop user home directory. Run the following command to create both directories:
cd hadoop/
mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}
  • Next, edit the core-site.xml file and update with your system hostname:
nano $HADOOP_HOME/etc/hadoop/core-site.xml

Change the following name as per your system hostname:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Save and close the file.

  • Then, edit the hdfs-site.xml file:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
  • Change the NameNode and DataNode directory paths as shown below:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
  • Then, edit the mapred-site.xml file:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
  • Make the following changes:
<configuration>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
</configuration>
  • Then, edit the yarn-site.xml file:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
  • Make the following changes:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Save the file and close it .

Step 12 : Start Hadoop cluster:

  • Before starting the Hadoop cluster. You will need to format the Namenode as a hadoop user.
  • Run the following command to format the Hadoop Namenode:
hdfs namenode -format
  • Once the namenode directory is successfully formatted with hdfs file system, you will see the message “Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted”.
  • Then start the Hadoop cluster with the following command.
start-all.sh
  • You can now check the status of all Hadoop services using the jps command:
jps

Step 13 : Access Hadoop Namenode and Resource Manager :

  • First we need to know our ip address,In Ubuntu we need to install net-tools to run ipconfig command, If you installing net-tools for the first time switch to default user :
sudo apt install net-tools
  • Then run ifconfig command to know our ip address:
ifconfig

Here my ip address is 192.168.1.6.

  • To access the Namenode, open your web browser and visit the URL http://your-server-ip:9870. You should see the following screen:

http://192.168.1.6:9870

  • To access Resource Manage, open your web browser and visit the URL http://your-server-ip:8088. You should see the following screen:

http://192.168.1.6:8088

Step 13 :Verify the Hadoop Cluster :

At this point, the Hadoop cluster is installed and configured. Next, we will create some directories in the HDFS filesystem to test the Hadoop.

  • Let’s create some directories in the HDFS filesystem using the following command:
hdfs dfs -mkdir /test1
hdfs dfs -mkdir /logs
  • Next, run the following command to list the above directory:
hdfs dfs -ls /

You should get the following output:

  • Also, put some files to hadoop file system. For the example, putting log files from host machine to hadoop file system.
hdfs dfs -put /var/log/* /logs/

You can also verify the above files and directory in the Hadoop web interface.

Go to the web interface, click on the Utilities => Browse the file system. You should see your directories which you have created earlier in the following screen:

Step 14 : To stop hadoop services :

To stop the Hadoop service, run the following command as a hadoop user:

stop-all.sh

This tutorial explained to you step-by-step tutorial to install and configure Hadoop on Ubuntu 22.04 Linux system.

Conclusion:

In summary, you've now equipped yourself with the knowledge and skills to install Hadoop on your Ubuntu system, marking the first step towards unlocking the immense potential of big data analytics. By conquering the installation process, you've paved the way for exploring the vast world of distributed data processing and analysis.

With Hadoop at your disposal, you have a formidable tool to tackle the challenges of managing and processing massive datasets. Whether you're a seasoned data professional seeking to expand your expertise or a newcomer to the world of data, this guide has made Hadoop more accessible than ever before.

So, as you embark on your journey into the realm of big data analytics, remember that this is just the beginning. There's a wealth of possibilities waiting for you to discover and harness. Happy exploring, and may your data-driven endeavors lead to valuable insights and groundbreaking discoveries!

--

--

Abhik Dey

A passionate and smart-working data scientist with expertise in NLP, generative AI, and computer vision.