Big Data — Hadoop

Henry Han
8 min readMar 13, 2018

--

Introducing Apache Hadoop

What is Apache Hadoop?

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Download Hadoop

go the following website, and select the latest version and binary file.

Hadoop installation and environment setup

Environment

I used Ubuntu 16–64 bit as the system environment.

Create a Hadoop user

If you’re not using “hadoop” as the user, you need to create a user named “hadoop”.

Open a new terminal, then type in the following command.

add a new user “hadoop”

Type in the password for the current logged-in user when password request prompts.

sudo command

sudo is a program for Unix-like computer operating system that allows users to run programs with the security privileges of another user, by default the superuser.

Use the following command to set up the password for “hadoop” user.

set up password for “hadoop”

Give “hadoop” admin privilege to solve unpredictable problems for the beginners.

give “hadoop” admin privilege

Update apt

Log in as “hadoop” first

Open up a new terminal, type in the following command to update apt.

update apt

We’re using apt to install apps. Some application cannot be installed without updating apt.

You can also install “vim”

install vim

Type “y” when being asked if you want to continue.

Install and configure SSH to login without password

Secure Shell (SSH) is a cryptographic network protocol for operating network services securely over an unsecured network.

Ubuntu has already installed SSH client by default, we also need to install SSH server.

install SSH server

Type “y” when being asked if you want to continue.

After installation, type in the following the command and see if you can login to the local machine.

Type “yes” password for “hadoop”

We need to type in password every time we log in, then how to configure to log in without typing in the password?

exit the connection
go to /.ssh/ folder
generate public/private rsa key pair
login without password

Configure Java environment

Issue the following command to install Java.

install Java

Type “y” when being asked if you want to continue.

Set the value of variable JAVA_HOME, type in the following command to edit bashrc file.

edit bashrc file

Add “export JAVA_HOME=/usr/lib/jvm/default-java” to the first line.

add a line

Run the following command to update the file immediately.

update bashrc file

To verify if Java has been successfully installed, type in the following command.

verify $JAVA_HOME

Install Hadoop 3

Make sure the hadoop installation file is in the Downloads folder

Extract installation file to /usr/local folder.

Go to /usr/local/ folder

Change the name of the folder to hadoop

Change the folder to be read-only.

Make sure you are in the /usr/local/hadoop folder, type in the following command to verify if Hadoop has been successfully installed.

If Hadoop has been successfully installed, the following message should be displayed on the screen.

Hadoop Standalone mode (non-distributed mode)

Create a folder called input inside /usr/local/hadoop folder.

Copy all xml files from /etc/hadoop folder into input folder.

Run the built-in program — hadoop-mapreduce-examples-3.0.0.jar

To see the output,

Hadoop pseudo-distributed mode

The configuration files of Hadoop are located in the /usr/local/hadoop/etc/hadoop/, you need to edit 2 configuration files — core-site.xml and hdfs-site.xml to switch to pseudo-distributed mode.

Run the following command to edit core-site.xml file

And change the configuration into the following settings.

To edit hdfs-site.xml file

And change the configuration into the following settings.

Format NameNode

Open a web browser, and go to http://localhost:9870 to view the information of NameNode and DataNoe, as well as the online files in HDFS.

Follow the steps to run a Hadoop pseudo-distributed mode instance.

Create a folder on HDFS.

Create a folder called input.

Copy all xml files from /etc/hadoop folder into input folder on HDFS.

To view the file list in the input folder.

Run the built-in program — hadoop-mapreduce-examples-3.0.0.jar

To see the output

You can also take the output to local

Remove the local output folder first

Take the output from HDFS to local

To see the output on the local

To remove the output folder on HDFS.

To stop Hadoop

To start Hadoop

Introduction of MapReduce

What is MapReduce?

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Algorithm

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

MapReduce paradigm is based on sending the computer to where the data resides.

MapReduce program executes in 3 stages, map stage, shuffle stage and reduce stage.

Map stage: the map or mapper’s job is to process the input data. The input data is in the form of file or directory and is stored in the HDFS. The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

Reduce stage: is the combination of shuffle and reduce. The reducer’s job is to process the data that comes from the mapper. It produces a new set of output, which will be stored in the HDFS.

Terminology

  1. NameNode — Node that manages the Hadoop Distributed File system (HDFS).
  2. DataNoe — Node where JobTracker runs and which accepts job requests from clients
  3. JobTracker — Schedules jobs and tracks the assign jobs to Task tracker.
  4. Task Tracker — Tracks the task and report status to JobTracker.
  5. Job — A program is an execution of a Mapper and Reducer across a dataset.
  6. Task — An execution of a Mapper or a Reducer on a slice of data.
  7. Mapper — Maps the input key/value pairs to a set of intermediate key/value pair.

Please refer to https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html for more details and examples about Hadoop MapReduce.

--

--