Installing Hadoop on Ubuntu 20.04

tanut aran
CODEMONDAY
Published in
3 min readJan 31, 2022

Below I wrap up how to installing process. This is good for experimental NOT production at all.

Credit: wikipedia

What you need to do

  1. Install Java
  2. Download Hadoop
  3. Set environment
  4. Edit Hadoop XML
  5. start-dfs.sh
  6. start-yarn.sh

If success you will see

  • localhost:8088 → See Hadoop icon screen
  • localhost:9870 → See cluster status screen

Install Java

Update and search for the new JDK.

If you are not familiar with Java, ignore its term we only need JDK.

sudo apt update
sudo apt-cache search openjdk

Latest LTS is 11 so I will install 11

sudo apt install openjdk-11-jdkjava -version
javac -version

Download Hadoop

Visit link below. In the command line you will need wget <link> to download it. Extract it to your home directory.

Choose the newer version. Here is 3.3.1 then choose the tar.gz

Setup Environment

Setting the variable for Hadoop and also path for convenient calling of Hadoop command in .bashrc

export HADOOP_HOME=/home/hadoop/hadoop-3.3.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/

At this point you must be able to call the following binary from anywhere

  • hadoop
  • hdfs

Edit Hadoop XML and Start

I think Hadoop page is good already. Link below.

Some quick overview will make reading easier

  • Hadoop will ssh to localhost so you will need to setup SSH key
  • You need pseudo distribution mode
  • Copy paste XML from Hadoop guide
  • Start DFS
  • Start YARN

Common error: JAVE_HOME not found

JAVA_HOME need to be set in

etc/hadoop/hadoop-env.sh

NOT in .bashrc !

Common error 2: Cannot start YARN

Error when start-yarn.sh

resourcemanager is running as process 48888. Stop it first and ensure /tmp/hadoop-hadoop-resourcemanager.pid file is empty before retry

Resource manager still running despite stop-dfs.sh so you need to stop ALL

stop-all.sh

Note

Just leave the process like that seem like we do not need to run it with systemctl or service as we usually do.

Check the status page

Finally you must see the result like below

localhost:8088
localhost:9870

Some tips if you deploy it on the server.

Use ssh to forward it down to localhost then open it with your browser.

ssh -L 9870:localhost:9870 -nNT ubuntu@<your-server-ip>ssh -L 8088:localhost:8088 -nNT ubuntu@<your-server-ip>

Hope this helps !

--

--