Harnessing the power of Big Data with Hadoop and Related Technologies(Hive, MapReduce and Pig)

Hadoop, if you have not heard about it yet, chances are, you’ll hear a lot more about it in future.

Hadoop, is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

In this brief tutorial, we will provide a quick introduction to Big Data, and discuss MapReduce algorithm, and HDFS(Hadoop Distributed File System).

How big is big data?

Big data is a collection of large datasets that cannot be processed using traditional computing techniques. Due to the advent of new technologies, devices, and communication means like social networking sites, the amount of data produced by mankind is growing rapidly every year. The amount of data produced till 2003 was 5 billion gigabytes and same amount has been generated in every two days in 2011, now currently in every ten minutes.

Big data is not merely a data; rather it has become a complete subject, which involves various tools, techniques and frameworks.

WHY BIG DATA?

‘Big Data’ is similar to ‘small data’, but bigger in size which requires different approaches in techniques, tools and architecture.

An aim to solve new problems or old problems in a better way.

Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques.

What defines BIG DATA?

The three main attributes of big data are;

Volume: Today, Facebook ingests 500 terabytes of new data every day. Boeing 737 will generate 240 terabytes of flight data during a single flight across the US.

Velocity: Click streams and ad impressions capture user behaviour at millions of events per second and high-frequency stock trading algorithms reflect market changes within microseconds.

Variety: Big Data isn’t just numbers, dates, strings; also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.

The above characteristics of Big Data not only constitutes to size of data but its value within data that targets the Market Intelligence and Business Intelligence. Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve.

BIG DATA Analytics: Big data analytics is the process of examining big data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions.

BIG DATA Technologies: Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Following are few sample paradigm’s elaborated for this blog:

Hosting — Distributed Servers / Cloud (e.g. Amazon EC2)

Storing — Distributed Storage (e.g. Amazon S3)

Programming — Distributed Processing (e.g. MapReduce)

Stored & Indexed — High-performance schema-free databases (e.g. MongoDB)

Operations — Analytic / Semantic Processing (e.g. Machine Learning)

So, What is Hadoop all about?

Hadoop is an Apache Foundation Project developed as a framework for handling big data annotations which needs data intensive distributed applications that perform parallel processing mechanism which is capable of handling petabytes of data.

Hadoop is developed using Java Programming language by referring the Google’s paper on MapReduce and File system.

Hadoop mainly provides two core things:

Storage: A reliable, distributed file system called HDFS (Hadoop Distributed File System).

Processing: The high-performance parallel data processing engine called Hadoop MapReduce

A set of machines running HDFS and MapReduce is known as a Hadoop Cluster.

The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. It provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system is designed so that node failures are automatically handled by the framework. It enables applications to work with thousands of computation-independent computers and petabytes of data. The entire Apache Hadoop platform is commonly considered to consist of the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS), and number of related projects including Apache Hive, Apache Hbase, Apache Pig, Zookeeper etc.

Let’s look at few Hadoop ecosystem concepts that are intensively used across the globe:

MapReduce

Hadoop Map-Reduce is a YARN-based system for parallel processing of large data sets. MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.

In the MapReduce model, a compute “job” is decomposed into smaller “tasks” (which correspond to separate Java Virtual Machine (JVM) processes in the Hadoop implementation). The tasks are distributed around the cluster to parallelize and balance the load as much as possible. The MapReduce runtime infrastructure coordinates the tasks, re-running any that fail or appear to hang. Users of MapReduce don’t need to implement parallelism or reliability features themselves. Instead, they focus on the data problem they are trying to solve.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes.

Pig

Pig is a platform for constructing data flows for extract, transform, and load (ETL) processing and analysis of large datasets. Pig Latin, the programming language for Pig provides common data manipulation operations, such as grouping, joining, and filtering. Pig generates Hadoop MapReduce jobs to perform the data flows. This high-level language for ad hoc analysis allows developers to inspect HDFS stored data without the need to learn the complexities of the MapReduce framework, thus simplifying the access to the data.

The Pig Latin scripting language is not only a higher-level data flow language but also has operators similar to SQL (e.g., FILTER and JOIN) that are translated into a series of map and reduce functions. Pig Latin, in essence, is designed to fill the gap between the declarative style of SQL and the low-level procedural style of MapReduce.

Hive

Hive is a SQL-based data warehouse system for Hadoop that facilitates data summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems (e.g., HDFS, MapR-FS, and S3) and some NoSQL databases. Hive is not a relational database, but a query engine that supports the parts of SQL specific to querying data, with some additional support for writing new tables or files, but not updating individual records. That is, Hive jobs are optimized for scalability, i.e., computing over all rows, but not latency, i.e., when you just want a few rows returned and you want the results returned quickly. Hive’s SQL dialect is called HiveQL. Table schema can be defined that reflect the data in the underlying files or data stores and SQL queries can be written against that data. Queries are translated to MapReduce jobs to exploit the scalability of MapReduce. Hive also support custom extensions written in Java, including user-defined functions (UDFs) and serializer-deserializers for reading and optionally writing custom formats, e.g., JSON and XML dialects. Hence, analysts have tremendous flexibility in working with data from many sources and in many different formats, with minimal need for complex ETL processes to transform data into more restrictive formats. Contrast with Shark and Impala.

To do this, examples work best. So we are going to use a basic word count program to illustrate how programming works within the MapReduce framework in Hadoop.

MapReduce Source Code:

WordCount.java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, “word count”);
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Execution Steps:

Set Environment Variables as below:

export JAVA_HOME=/usr/java/default
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

Compile WordCount.java and create a jar:

$ bin/hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class

Create directories InputDir and OutputDir.

Sample text-files as input:

$ bin/hadoop fs -ls /user/joe/wordcount/input/ /user/joe/wordcount/input/file01
$ bin/hadoop fs -cat /user/joe/wordcount/input/file01
Hello World Bye World
Hello Hadoop Goodbye Hadoop

Run the application:

Output:

$ bin/hadoop fs -cat /user/joe/wordcount/output/part-r-00000`
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

Snippet MapReduce:

Public class Mapping Extends Mapper<object, Text, Text, IntWritable>
{
Private Final static IntWritable one = new IntWritable(1);
Private Text word = new Text();
Public void map (Object key, Text value, Context context)
Throws IOException, InterprutedException {
StringTokenizer itr = new StringTokenizer(value.toString());
While (itr.hasMoreTokens()) {
Word.set(itr.nextToken()) ;
Context.write(word, one);
}}}
Public class Reduceing extends Reducer<Text,IntWritable,Text,IntWritable> {
Private IntWritable result = new IntWritable();
Public void reduce(Text key, Iterable<IntWritable> values, Context context)
Throws IOException, InteruptedException {
Int sum =0;
For(IntWritable val : values) {
Sum+=val.get();
}
Result.set(sum);
Context.write(key, result);
}}

Pig and Hive with same example works more elegantly by dramatically reducing the amount of code due to inbuilt libraries and functions at the same time giving the programmers less personalization and

PIG

Below is the sample snippet for word count problem in Hadoop Pig.

Lines = LOAD ‘/file01’ AS (line:chararray); — — Loading file.
Words = foreach lines Generate Flatten (tokenise(line)) AS word.
Words = Filter words by word Matches ‘||w+’; — — Filters words.
Words_grouping = Group filtered_words BY word; — — Grouping words.
Word_count = FOREACH word_grouping GENERATE COUNT(filtered_words) AS count, group as word.

Hive

Below is the sample snippet for word count problem in Hadoop Hive.

Create External Table lines(line string) LOAD DATA INPATH ‘file01’ OVERWRITE INTO TABLE lines; — — Importing file as lines.
Select word, count (*) from lines LATERAL VIEW explode(split(text, ‘ ‘)) ltable as word Group by word; — — Creates virtual view by splitting lines.
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.