Setup Hadoop psuedo cluster on Ubuntu 16.04

4 min readDec 27, 2018

This article to set up Hadoop on your personal Ubuntu machine from scratch. Please follow step by step and you will be able to start the Hadoop process along with yarn and we will also demonstrate how to upload the file in HDFS and execute a MR job using the HDFS uploaded file.

Check java version is installed

Please ensure java is installed on the system. If not installed please refer web and install the same.

nitin@nitin-Satellite-C850:~$ java -version
openjdk version “1.8.0_191”

Setup ssh on the system

Install openssh

nitin@nitin-Satellite-C850:~$ sudo apt-get install openssh-server

2. Verify that public key authentication is enabled in config file.

Open “sudo vi /etc/ssh/sshd_config” and ensure “PubkeyAuthentication” is set to “yes”

3. Restart ssh service

nitin@nitin-Satellite-C850:~$ sudo service ssh restart

Create new user “hduser” for hadoop

Create new group.

nitin@nitin-Satellite-C850:~$ sudo addgroup hadoop

2. Create new user in group created.

nitin@nitin-Satellite-C850:~$ sudo adduser — ingroup hadoop hduser

Set password less authentication for newly created “hduser”

Switch to hduser

nitin@nitin-Satellite-C850:~$ su — hduser

2. Create RSA key pair

hduser@nitin-Satellite-C850:~$ ssh-keygen -t rsa -P “”

3. Copy the public key created as “authorized_key” to enable hduser to ssh to localhost. After copy change permissions as well.

hduser@nitin-Satellite-C850:~$ cat $HOME/.ssh/id_rsa.pub > $HOME/.ssh/authorized_keys
hduser@nitin-Satellite-C850:~$ chmod 0600 $HOME/.ssh/authorized_keys

4. Test password less authentication and you should be able to connect to localhost without entering any password.

hduser@nitin-Satellite-C850:~$ ssh localhost

Download and install Hadoop

Download Hadoop to home directory.

nitin@nitin-Satellite-C850:~$ wget http://mirrors.estointernet.in/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz

2. Untar hadoop tar file.

nitin@nitin-Satellite-C850:~$ tar -xzvf hadoop-2.7.7.tar.gz

3. Move hadoop extracted folder to /usr/local via sudo access.

nitin@nitin-Satellite-C850:~$ sudo mv hadoop-2.7.7 /usr/local/hadoop

4. Change the ownership to hadoop user “hduser”.

nitin@nitin-Satellite-C850:/usr/local$ sudo chown -R hduser:hadoop /usr/local/hadoop

Please note now on-wards all activities are done as user “hduser”

5. su to “hduser” and update “.bashrc” file via “vi” or “nano” editor

export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:bin/java::”)
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME

6. Reload “.bashrc” file.

hduser@nitin-Satellite-C850:~$ source ~/.bashrc

7. Update “/usr/local/hadoop/etc/hadoop/hadoop-env.sh” for “JAVA_HOME”

Comment existing “JAVA_HOME” and insert below line. It will dynamically insert the relevant JAVA_HOME.

export JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:bin/java::”)

Test Hadoop setup

We will to verify the Hadoop setup in local mode first before moving to actual pseudo mode where yarn will be also running

Create input directory in home directory.

hduser@nitin-Satellite-C850:~$ mkdir ~/input

2. Copy the test data from Hadoop setup to “input” directory.

hduser@nitin-Satellite-C850:~$ cp /usr/local/hadoop/etc/hadoop/*.xml ~/input

3. Execute Hadoop job.

hduser@nitin-Satellite-C850:~$ /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar grep ~/input ~/output ‘principal[.]*’

We will see Hadoop job will be executed and result will be stored in “output” directory at home folder from where job is executed.

Configure Hadoop for Psuedo operation

We run to execute Hadoop on standalone cluster where we can execute Map-Reduce jobs, have HDFS operations.

Create work directory where date for Hadoop operations will be stored.

hduser@nitin-Satellite-C850:~$ mkdir -p app/hadoop/

hduser@nitin-Satellite-C850:~$mkdir app/hadoop/namenode
hduser@nitin-Satellite-C850:~$ mkdir app/hadoop/datanode
hduser@nitin-Satellite-C850:~$ mkdir app/hadoop/tmp

2. Configure “/usr/local/hadoop/etc/hadoop/hdfs-site.xml”

Configuring replication factor as 1 from default 3
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

Configure path for name node meta data
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hduser/app/hadoop/namenode</value>
</property>

Configure path for data node meta data
<property>
<name>dfs.namenode.data.dir</name>
<value>/home/hduser/app/hadoop/datanode</value>
</property>

3. Configure “/usr/local/hadoop/etc/hadoop/core-site.xml”

Configure port number used for Hadoop instance
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

Configure temporaray directory to be used internally by hadoop system
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

4. Configure “/usr/local/hadoop/etc/hadoop/mapred-site.xml”

Copy the mapred-site xml from mapred-site.xml.template

cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

Configure map reduce framework
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

5. Configure “/usr/local/hadoop/etc/hadoop/yarn-site.xml”

Configure yarn into hadoop
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Format the Hadoop file system

hduser@nitin-Satellite-C850:~$ hdfs namenode -format

This will result in formatting of “namenode” directory indicated by below line in the output with exit status as 0

Storage directory /home/hduser/app/hadoop/name has been successfully formatted.

Start the Hadoop processes

Start “NameNode”, “DataNode”, “SecondaryNameNode” (Verify via opening http://localhost:50070/)

hduser@nitin-Satellite-C850:~$ start-dfs.sh

2. Start Yarn “ResouceManager” and “NodeManager” (Verify via opening http://localhost:8088/)

hduser@nitin-Satellite-C850:~$ start-yarn.sh

3. Start “JobHistoryServer” (Verify via http://localhost:19888/)

hduser@nitin-Satellite-C850:~$ mr-jobhistory-daemon.sh start historyserver

Verify the Hadoop jobs are running

hduser@nitin-Satellite-C850:~$ jps
30161 JobHistoryServer
28884 ResourceManager
30373 Jps
28681 SecondaryNameNode
28283 NameNode
28429 DataNode
29135 NodeManager

Now we have all Hadoop processes running

Setup HDFS directory for our “hduser”

hduser@nitin-Satellite-C850:~$ hdfs dfs -mkdir /user

hduser@nitin-Satellite-C850:~$ hdfs dfs -mkdir /user/hduser

Test our newly setup Hadoop cluster. We will push our test data to HDFS and execute same old MR job in Hadoop cluster.

Copy input directory to HDFS (Verify data on http://localhost:50070/)

hduser@nitin-Satellite-C850:~$ hdfs dfs -copyFromLocal input

2. Execute MR job on HDFS (Verify the jobs running on resource manager http://localhost:8088/)

hduser@nitin-Satellite-C850:~$ /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar grep input output ‘principal[.]*’

3. Check output on HDFS

hduser@nitin-Satellite-C850:~$ hdfs dfs -ls output
18/12/26 21:08:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 2 items
-rw-r — r — 1 hduser supergroup 0 2018–12–26 20:22 output/_SUCCESS
-rw-r — r — 1 hduser supergroup 25 2018–12–26 20:22 output/part-r-00000

Hence we have created a standalone Hadoop cluster which is ready for DFS commands and to be itegrated with Sqoop and other tools.

Setup Hadoop psuedo cluster on Ubuntu 16.04

Please note now on-wards all activities are done as user “hduser”

Written by Nitin Gupta