Setup Hadoop psuedo cluster on Ubuntu 16.04
This article to set up Hadoop on your personal Ubuntu machine from scratch. Please follow step by step and you will be able to start the Hadoop process along with yarn and we will also demonstrate how to upload the file in HDFS and execute a MR job using the HDFS uploaded file.
Check java version is installed
Please ensure java is installed on the system. If not installed please refer web and install the same.
nitin@nitin-Satellite-C850:~$ java -version
openjdk version “1.8.0_191”
Setup ssh on the system
- Install openssh
nitin@nitin-Satellite-C850:~$ sudo apt-get install openssh-server
2. Verify that public key authentication is enabled in config file.
Open “sudo vi /etc/ssh/sshd_config” and ensure “PubkeyAuthentication” is set to “yes”
3. Restart ssh service
nitin@nitin-Satellite-C850:~$ sudo service ssh restart
Create new user “hduser” for hadoop
- Create new group.
nitin@nitin-Satellite-C850:~$ sudo addgroup hadoop
2. Create new user in group created.
nitin@nitin-Satellite-C850:~$ sudo adduser — ingroup hadoop hduser
Set password less authentication for newly created “hduser”
- Switch to hduser
nitin@nitin-Satellite-C850:~$ su — hduser
2. Create RSA key pair
hduser@nitin-Satellite-C850:~$ ssh-keygen -t rsa -P “”
3. Copy the public key created as “authorized_key” to enable hduser to ssh to localhost. After copy change permissions as well.
hduser@nitin-Satellite-C850:~$ cat $HOME/.ssh/id_rsa.pub > $HOME/.ssh/authorized_keys
hduser@nitin-Satellite-C850:~$ chmod 0600 $HOME/.ssh/authorized_keys
4. Test password less authentication and you should be able to connect to localhost without entering any password.
hduser@nitin-Satellite-C850:~$ ssh localhost
Download and install Hadoop
- Download Hadoop to home directory.
nitin@nitin-Satellite-C850:~$ wget http://mirrors.estointernet.in/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
2. Untar hadoop tar file.
nitin@nitin-Satellite-C850:~$ tar -xzvf hadoop-2.7.7.tar.gz
3. Move hadoop extracted folder to /usr/local via sudo access.
nitin@nitin-Satellite-C850:~$ sudo mv hadoop-2.7.7 /usr/local/hadoop
4. Change the ownership to hadoop user “hduser”.
nitin@nitin-Satellite-C850:/usr/local$ sudo chown -R hduser:hadoop /usr/local/hadoop
Please note now on-wards all activities are done as user “hduser”
5. su to “hduser” and update “.bashrc” file via “vi” or “nano” editor
export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:bin/java::”)
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME
6. Reload “.bashrc” file.
hduser@nitin-Satellite-C850:~$ source ~/.bashrc
7. Update “/usr/local/hadoop/etc/hadoop/hadoop-env.sh” for “JAVA_HOME”
Comment existing “JAVA_HOME” and insert below line. It will dynamically insert the relevant JAVA_HOME.
export JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:bin/java::”)
Test Hadoop setup
We will to verify the Hadoop setup in local mode first before moving to actual pseudo mode where yarn will be also running
- Create input directory in home directory.
hduser@nitin-Satellite-C850:~$ mkdir ~/input
2. Copy the test data from Hadoop setup to “input” directory.
hduser@nitin-Satellite-C850:~$ cp /usr/local/hadoop/etc/hadoop/*.xml ~/input
3. Execute Hadoop job.
hduser@nitin-Satellite-C850:~$ /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar grep ~/input ~/output ‘principal[.]*’
We will see Hadoop job will be executed and result will be stored in “output” directory at home folder from where job is executed.
Configure Hadoop for Psuedo operation
We run to execute Hadoop on standalone cluster where we can execute Map-Reduce jobs, have HDFS operations.
- Create work directory where date for Hadoop operations will be stored.
hduser@nitin-Satellite-C850:~$ mkdir -p app/hadoop/
hduser@nitin-Satellite-C850:~$mkdir app/hadoop/namenode
hduser@nitin-Satellite-C850:~$ mkdir app/hadoop/datanode
hduser@nitin-Satellite-C850:~$ mkdir app/hadoop/tmp
2. Configure “/usr/local/hadoop/etc/hadoop/hdfs-site.xml”
Configuring replication factor as 1 from default 3
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Configure path for name node meta data
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hduser/app/hadoop/namenode</value>
</property>
Configure path for data node meta data
<property>
<name>dfs.namenode.data.dir</name>
<value>/home/hduser/app/hadoop/datanode</value>
</property>
3. Configure “/usr/local/hadoop/etc/hadoop/core-site.xml”
Configure port number used for Hadoop instance
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
Configure temporaray directory to be used internally by hadoop system
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
4. Configure “/usr/local/hadoop/etc/hadoop/mapred-site.xml”
Copy the mapred-site xml from mapred-site.xml.template
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
Configure map reduce framework
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
5. Configure “/usr/local/hadoop/etc/hadoop/yarn-site.xml”
Configure yarn into hadoop
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Format the Hadoop file system
hduser@nitin-Satellite-C850:~$ hdfs namenode -format
This will result in formatting of “namenode” directory indicated by below line in the output with exit status as 0
Storage directory /home/hduser/app/hadoop/name has been successfully formatted.
Start the Hadoop processes
- Start “NameNode”, “DataNode”, “SecondaryNameNode” (Verify via opening http://localhost:50070/)
hduser@nitin-Satellite-C850:~$ start-dfs.sh
2. Start Yarn “ResouceManager” and “NodeManager” (Verify via opening http://localhost:8088/)
hduser@nitin-Satellite-C850:~$ start-yarn.sh
3. Start “JobHistoryServer” (Verify via http://localhost:19888/)
hduser@nitin-Satellite-C850:~$ mr-jobhistory-daemon.sh start historyserver
Verify the Hadoop jobs are running
hduser@nitin-Satellite-C850:~$ jps
30161 JobHistoryServer
28884 ResourceManager
30373 Jps
28681 SecondaryNameNode
28283 NameNode
28429 DataNode
29135 NodeManager
Now we have all Hadoop processes running
Setup HDFS directory for our “hduser”
hduser@nitin-Satellite-C850:~$ hdfs dfs -mkdir /user
hduser@nitin-Satellite-C850:~$ hdfs dfs -mkdir /user/hduser
Test our newly setup Hadoop cluster. We will push our test data to HDFS and execute same old MR job in Hadoop cluster.
- Copy input directory to HDFS (Verify data on http://localhost:50070/)
hduser@nitin-Satellite-C850:~$ hdfs dfs -copyFromLocal input
2. Execute MR job on HDFS (Verify the jobs running on resource manager http://localhost:8088/)
hduser@nitin-Satellite-C850:~$ /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar grep input output ‘principal[.]*’
3. Check output on HDFS
hduser@nitin-Satellite-C850:~$ hdfs dfs -ls output
18/12/26 21:08:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Found 2 items
-rw-r — r — 1 hduser supergroup 0 2018–12–26 20:22 output/_SUCCESS
-rw-r — r — 1 hduser supergroup 25 2018–12–26 20:22 output/part-r-00000
Hence we have created a standalone Hadoop cluster which is ready for DFS commands and to be itegrated with Sqoop and other tools.