Hadoop Fundamental

Published in

sarccom

7 min readFeb 6, 2017

I was exited when someone introduce the early alpha version of Hadoop. So i just want to share a little bit fundamental of apache hadoop.

Visit and join Software Architect Indonesia Community to get more knowledge

The article will include this 5 topics

1. HADOOP OVERVIEW

2. HDFS (MOTHER)

3. YARN (FATHER)

4. MAP REDUCE (THINKER)

5. SPARK (ANOTHER SMART THINKER)

Let’s go through one by one…

1. HADOOP OVERVIEW

History:

Named after a toy elephant belonging to developer Doug Cutting’s son
Hadoop’s roots wind back to Nutch, the 2002 project started by Doug Cutting and Mike Cafarella
Concept start when Google release their paper called

GFS (Google File System — 2003) = Hadoop HDFS,http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
MapReduce (2004) = Hadoop MapReduce,http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
Big Table

Reference : https://hadoop.apache.org/

Hadoop Purpose:

“Framework that allows distributed processing of large data sets across clusters of computers…
using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

Hadoop Key Component:

Hadoop Common: Common utilities
(Storage Component) Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access

Many other data storage approaches also in use
E.G., Apache Cassandra, Apache Hbase, Apache Accumulo (NSA-contributed)

(Scheduling) Hadoop YARN: A framework for job scheduling and cluster resource management
(Processing) Hadoop MapReduce (MR2): A YARN-based system for parallel processing of large data sets

Other execution engines increasingly in use, e.g., Spark

Note:

All of these key components are OSS under Apache 2.0 license
Related: Ambari, Avro, Cassandra, Chukwa, Hbase, Hive, Mahout, Pig, Tez, Zookeeper

2. HDFS (MOTHER)

Inspired by “Google File System”
Stores large files (typically gigabytes-terabytes) across multiple machines, replicating across multiple hosts
Breaks up files into fixed-size blocks (typically 64 MiB), distributes blocks
The blocks of a file are replicated for fault tolerance
The block size and replication factor are configurable per file
Default replication value (3) — data is stored on three nodes: two on the same rack, and one on a different rack
File system intentionally not fully POSIX-compliant
Write-once-read-many access model for files. A file once created, written, and closed cannot be changed. This assumption simplifies data coherency issues and enables high throughput data access
Intend to add support for appending-writes in the future
Can rename & remove files
“Namenode” tracks names and where the blocks are

HDFS Architecture

Hadoop can work with any distributed file system but this loses locality
To reduce network traffic, Hadoop must know which servers are closest to the data; HDFS does this
Hadoop job tracker schedules jobs to task trackers with an awareness of the data location

For example, if node A contains data (x,y,z) and node B contains data (a,b,c), the job tracker schedules node B to perform tasks on (a,b,c) and node A would be scheduled to perform tasks on (x,y,z)
This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer
Location awareness can significantly reduce job-completion times when running data-intensive jobs

Handling structured data

Data often more structured
Google BigTable (2006 paper)
Designed to scale to Petabytes, 1000s of machines
Maps two arbitrary string values (row key and column key) and timestamp into an associated arbitrary byte array
Tables split into multiple “tablets” along row chosen so tablet will be ~200 megabytes in size (compressed when necessary)
Data maintained in lexicographic order by row key; clients can exploit this by selecting row keys for good locality (e.g., reversing hostname in URL)
Not a relational database; really a sparse, distributed multi-dimensional sorted map
Implementations of approach include: Apache Accumulo (from NSA; cell-level access labels), Apache Cassandra, Apache Hbase, Google Cloud BigTable (released 2005)
Sources: https://en.wikipedia.org/wiki/BigTable; “Bigtable…” by Chang

3. YARN (Yet Another Resource Negotiator) — FATHER

Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology.
YARN is one of the key features in the second-generation Hadoop 2 version of the Apache Software Foundation’s open source distributed processing framework.
YARN is a software rewrite that decouples MapReduce’s resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications
YARN combines a central resource manager that reconciles the way applications use Hadoop system resources with node manager agents that monitor the processing operations of individual cluster nodes

Why YARN

YARN Architecture

Yarn Component

Resource Manager

ResourceManager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers (NMs) and the per-application ApplicationMasters (AMs).
NodeManagers take instructions from the ResourceManager and manage resources available on a single node.
ApplicationMasters are responsible for negotiating resources with the ResourceManager and for working with the NodeManagers to start the containers.

Node Manager

The NodeManager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.

Application Master

The ApplicationMaster is, in effect, an instance of a framework-specific library and is responsible for negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the containers and their resource consumption. It has the responsibility of negotiating appropriate resource containers from the ResourceManager, tracking their status and monitoring progress.

4. MAPREDUCE V2 (Thinker)

MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster
Programmer defines two functions, map & reduce
Map(k1,v1) → list(k2,v2). Takes a series of key/value pairs, processes each, generates zero or more output key/value pairs
Reduce(k2, list (v2)) → list(v3). Executed once for each unique key k2 in the sorted order; iterate through the values associated with that key and produce zero or more outputs
System “shuffles” data between map and reduce (so “reduce” function has set of data for its keys) automatically handles system failures, etc.

MapReduce: Word Count Example (Pseudocode)

map(String input_key, String input_value):

// input_key: document name

// input_value: document contents

for each word w in input_value:

EmitIntermediate(w, “1”);

reduce(String output_key, Iterator intermediate_values):

// output_key: a word

// output_values: a list of counts

int result = 0;

for each v in intermediate_values:

result += ParseInt(v);

Emit(AsString(result));

Mapreduce Combiner

Can also define an option function “Combiner” (to optimize bandwidth)
If defined, runs after Mapper & before Reducer on every node that has run a map task
Combiner receives as input all data emitted by the Mapper instances on a given node
Combiner output sent to the Reducers, instead of the output from the Mappers
Is a “mini-reduce” process which operates only on data generated by one machine
If a reduce function is both commutative and associative, then it can be used as a Combiner as well
Useful for word count — combine local counts

MapReduce problems & potential solution

MapReduce problems:

Many problems aren’t easily described as map-reduce
Persistence to disk typically slower than in-memory work

Alternative: Apache Spark

a general-purpose processing engine that can be used instead of MapReduce

5. APACHE SPARK (ANOTHER SMART THINKER)

Apache Spark is a cluster computing platform designed to be fast and general-purpose.
On the speed side, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. Speed is important in processing large datasets, as it means the difference between exploring data interactively and waiting minutes or hours. One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk

Spark Architecture

Spark Mapreduce Comparison

Hadoop MapReduce is meant for data that does not fit in the memory whereas Apache Spark has a better performance for the data that fits in the memory, particularly on dedicated clusters.
Hadoop MapReduce can be an economical option because of Hadoop as a service offering(HaaS) and availability of more personnel. According to the benchmarks, Apache Spark is more cost effective but staffing would be expensive in case of Spark.
Apache Spark and Hadoop MapReduce both are failure tolerant but comparatively Hadoop MapReduce is more failure tolerant than Spark.
Spark and Hadoop MapReduce both have similar compatibility in terms of data types and data sources.
Programming in Apache Spark is easier as it has an interactive mode whereas Hadoop MapReduce requires core java programming skills,however there are several utilities that make programming in Hadoop MapReduce easier.

START YOUR HADOOP NOW

Thank you for reading.

Join https://www.meetup.com/Software-Architect-Indonesia/

Hadoop Fundamental

1. HADOOP OVERVIEW

2. HDFS (MOTHER)

3. YARN (FATHER)

4. MAP REDUCE (THINKER)

5. SPARK (ANOTHER SMART THINKER)

1. HADOOP OVERVIEW

2. HDFS (MOTHER)

3. YARN (Yet Another Resource Negotiator) — FATHER

4. MAPREDUCE V2 (Thinker)

5. APACHE SPARK (ANOTHER SMART THINKER)

Written by Buyung Bahari