Installation of HADOOP in Ubuntu
Apache Hadoop is a Java-based, open-source, freely available software platform for storing and analyzing big datasets on your system clusters.
What is Hadoop?
Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.
What you need to do
- Install Java
- Download Hadoop
- Set environment
- Edit Hadoop XML
- start-dfs.sh
- start-yarn.sh
Hadoop: https://hadoop.apache.org/
Ubuntu:https://ubuntu.com/
If success you will see
- localhost:8088 → See Hadoop icon screen
- localhost:9870 → See cluster status screen
Step 1: Install Java Development Kit
sudo apt update && sudo apt install openjdk-8-jdk
Step 2: Verify the Java version :
java -version
Step 3: Install SSH :
Before installing SSH or any new application in the system, it’s a good practice to update the package lists to ensure you have the latest versions of packages. Run the following command in the terminal:
sudo apt update
Step 4: Create the Hadoop user :
Hadoop doesn’t have users like Linux does. External LDAP/Kerberos systems generally manage users.
sudo adduser hadoop
Step 5: Switch user :
su - hadoop
Step 6: Configure SSH :
ssh-keygen -t rsa
Step 7: Set permissions
Place the generated public key into the authorized key file and ensure the correct permissions are set.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys
Step 8: SSH to the local-host
To allow git to use your SSH key, an SSH agent needs to be running on your device.
ssh localhost
Step 9: Switch user
Again, switch to Hadoop in the Ubuntu terminal
su - hadoop
Step 10: Install Hadoop Download Hadoop 3.3.6
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
To unzip the file, we can use this command
tar -xvzf hadoop-3.3.6.tar.gz
mv hadoop-3.3.6 hadoop
By using the above command, we are changing the name of the file
Next, in the process, we need to configure Hadoop and Java Environment Variables
nano ~/.bashrc
we have to insert the below code in the file
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
In the process of writing on the file, we use CTRL+O to exit from the file we use CTRL+X.
Load the above configuration in the current environment.
source ~/.bashrc
You also need to configure JAVA_HOME in the Hadoop-env.sh file.
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Search for the “export JAVA_HOME” and configure it.
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Step 11: Configuring Hadoop :
First, you must create the name and data node directories inside the Hadoop user home directory. Run the following command to create both directories:
cd hadoop/
mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}
nano $HADOOP_HOME/etc/hadoop/core-site.xml
After the file opening, we have to add the command to the file
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
To close the file, we will use the same process.
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Then, edit the mapred-site.xml file:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</val
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</val
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</val
</property>
</configuration>
Then, edit the yarn-site.xml file:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Save the file and close it
Step 12: Start Hadoop cluster:
A Hadoop cluster is a collection of computers, known as nodes, that are networked to perform these kinds of parallel computations on big data sets.
hdfs namenode -format
start-all.sh
jps
ERROR: In the the case of error in the above process like
hadoop@lab-H410M-H:~/hadoop$ start-all.sh
WARNING: Attempting to start all Apache Hadoop daemons as hadoop in 10 seconds.
WARNING: This is not a recommended production deployment configuration.
WARNING: Use CTRL-C to abort.
Starting namenodes on [localhost]
pdsh@lab-H410M-H: localhost: rcmd: socket: Permission denied
Starting datanodes
pdsh@lab-H410M-H: localhost: rcmd: socket: Permission denied
Starting secondary namenodes [lab-H410M-H]
pdsh@lab-H410M-H: lab-H410M-H: rcmd: socket: Permission denied
Starting resourcemanager
Starting nodemanagers
pdsh@lab-H410M-H: localhost: rcmd: socket: Permission denied
the solution LINK
pdsh -q -w localhost
export PDSH_RCMD_TYPE=ssh
sbin / start-dfs.sh
After that give the above code to start the name node again
Step 13: Access Hadoop Namenode and Resource Manager :
NameNode:
NameNode works on the Master System. The primary purpose of Namenode is to manage all the metadata. Metadata is the files stored in HDFS(Hadoop Distributed File System). As we know, the data is stored as blocks in a Hadoop cluster. So, the DataNode on which or the location at which that block of the file is stored is mentioned in MetaData.
The below code has to run in a new terminal
sudo apt install net-tools
ifconfig
Through this process, we can get a Localhost ID to access the HADOPP environment.
Step 13: Verify the Hadoop Cluster :
At this point, the Hadoop cluster is installed and configured. Next, we will create
some directories in the HDFS filesystem to test the Hadoop.
Let’s create some directories in the HDFS filesystem using the following
Command:
hdfs dfs -mkdir /test1
hdfs dfs -mkdir /logs
hdfs dfs -ls /
hdfs dfs -put /var/log/* /logs/
By the above code, we can access the Hadoop
Step 14: To stop Hadoop services :
stop-all.sh
This code will stop the Hadoop process and exit the process
Conclusion:
In this tutorial, you’ve installed Hadoop in stand-alone mode and verified it by running an example program it provided. To learn how to write your own Map Reduce programs, you might want to visit Apache Hadoop’s Map-reduce tutorial which walks through the code behind the example. When you’re ready to set up a cluster, see the Apache Foundation Hadoop Cluster Setup guide.