Introducing Apache Hadoop
What is Apache Hadoop?
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Download Hadoop
go the following website, and select the latest version and binary file.
Hadoop installation and environment setup
Environment
I used Ubuntu 16–64 bit as the system environment.
Create a Hadoop user
If you’re not using “hadoop” as the user, you need to create a user named “hadoop”.
Open a new terminal, then type in the following command.
Type in the password for the current logged-in user when password request prompts.
sudo command
sudo is a program for Unix-like computer operating system that allows users to run programs with the security privileges of another user, by default the superuser.
Use the following command to set up the password for “hadoop” user.
Give “hadoop” admin privilege to solve unpredictable problems for the beginners.
Update apt
Log in as “hadoop” first
Open up a new terminal, type in the following command to update apt.
We’re using apt to install apps. Some application cannot be installed without updating apt.
You can also install “vim”
Type “y” when being asked if you want to continue.
Install and configure SSH to login without password
Secure Shell (SSH) is a cryptographic network protocol for operating network services securely over an unsecured network.
Ubuntu has already installed SSH client by default, we also need to install SSH server.
Type “y” when being asked if you want to continue.
After installation, type in the following the command and see if you can login to the local machine.
Type “yes” password for “hadoop”
We need to type in password every time we log in, then how to configure to log in without typing in the password?
Configure Java environment
Issue the following command to install Java.
Type “y” when being asked if you want to continue.
Set the value of variable JAVA_HOME, type in the following command to edit bashrc file.
Add “export JAVA_HOME=/usr/lib/jvm/default-java” to the first line.
Run the following command to update the file immediately.
To verify if Java has been successfully installed, type in the following command.
Install Hadoop 3
Make sure the hadoop installation file is in the Downloads folder
Extract installation file to /usr/local folder.
Go to /usr/local/ folder
Change the name of the folder to hadoop
Change the folder to be read-only.
Make sure you are in the /usr/local/hadoop folder, type in the following command to verify if Hadoop has been successfully installed.
If Hadoop has been successfully installed, the following message should be displayed on the screen.
Hadoop Standalone mode (non-distributed mode)
Create a folder called input inside /usr/local/hadoop folder.
Copy all xml files from /etc/hadoop folder into input folder.
Run the built-in program — hadoop-mapreduce-examples-3.0.0.jar
To see the output,
Hadoop pseudo-distributed mode
The configuration files of Hadoop are located in the /usr/local/hadoop/etc/hadoop/, you need to edit 2 configuration files — core-site.xml and hdfs-site.xml to switch to pseudo-distributed mode.
Run the following command to edit core-site.xml file
And change the configuration into the following settings.
To edit hdfs-site.xml file
And change the configuration into the following settings.
Format NameNode
Open a web browser, and go to http://localhost:9870 to view the information of NameNode and DataNoe, as well as the online files in HDFS.
Follow the steps to run a Hadoop pseudo-distributed mode instance.
Create a folder on HDFS.
Create a folder called input.
Copy all xml files from /etc/hadoop folder into input folder on HDFS.
To view the file list in the input folder.
Run the built-in program — hadoop-mapreduce-examples-3.0.0.jar
To see the output
You can also take the output to local
Remove the local output folder first
Take the output from HDFS to local
To see the output on the local
To remove the output folder on HDFS.
To stop Hadoop
To start Hadoop
Introduction of MapReduce
What is MapReduce?
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
Algorithm
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
MapReduce paradigm is based on sending the computer to where the data resides.
MapReduce program executes in 3 stages, map stage, shuffle stage and reduce stage.
Map stage: the map or mapper’s job is to process the input data. The input data is in the form of file or directory and is stored in the HDFS. The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
Reduce stage: is the combination of shuffle and reduce. The reducer’s job is to process the data that comes from the mapper. It produces a new set of output, which will be stored in the HDFS.
Terminology
- NameNode — Node that manages the Hadoop Distributed File system (HDFS).
- DataNoe — Node where JobTracker runs and which accepts job requests from clients
- JobTracker — Schedules jobs and tracks the assign jobs to Task tracker.
- Task Tracker — Tracks the task and report status to JobTracker.
- Job — A program is an execution of a Mapper and Reducer across a dataset.
- Task — An execution of a Mapper or a Reducer on a slice of data.
- Mapper — Maps the input key/value pairs to a set of intermediate key/value pair.
Please refer to https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html for more details and examples about Hadoop MapReduce.