You must have heard about it. It is everywhere, everyone in town is talking about it, almost every organisation is using it, it is one of the hottest thing right now. If you guessed Big Data, then you are right.
But what is Big Data? What is Hadoop? Why has it become so popular lately ? Why organisations pay pretty heavy salary to Data Scientists and Data Analysts ? There are lot many things like these which I had in my mind and I assume you have it too and wonder, what the heck is Big Data ?
Note- This article is written to give a very simple and high level overview of what Big Data is and how it works so that it is easily understood by anyone.
What is Big Data?
Data from enormous digital sources needs analytics to be performed on so to get maximum benefit from it. Analysing the data helps companies in numerous ways. Let us see how different organisation uses Big Data:
1.Facebook,Google,Yahoo — Analyses data to better understand what the user is looking for and shows relevant ads.
2.Amazon — Recommends relevant product on your history of browsing the website.
3.Netflix — You may start watching new TV Show because of the power of recommendation system which Netflix has built from the user’s history data.
4.Telecom Companies — Analyses the call records and patterns and provides customers with customised offers which the customer can find attractive.
There are many such examples of how big data is used by different companies and benefiting them.
Big Data is a very subjective term, for you it might be some gigabytes and for some big organisation it might be few petabytes.One reasonable definition is
It is the data that is too big to be processed on a single machine — Ian Wrigley (Cloudera)
When data becomes gigantic there comes the challenge to store and process those bulky data. Storing them in a single machine is not feasible and even if storing is made possible by adding several extra disks, it is very difficult to process those data in limited time. We will see how the problem is solved shortly.
Elements of Big Data
Whenever you here about Big Data it is common to come across 3 Vs that are associated with big data, Volume,Velocity and Variety.These Vs are among the reasons why traditional methods of handling data are inefficient in solving big data problems.
Lets have some discussion on 3Vs
Volume — We have already talked about how big Big Data is.
Velocity — This refers to the speed with which data is generated.Data is generated every second, we need some way to process it, keeping up with the pace. Example-Sensors generating readings every second.
Variety — The are different types of data and some cannot be stored in a relational database.
Types of Data
Structured Data — Data in the form of tables which can easily be stored in Relational Database.
Semi Structured Data — It is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. Example Xml, Json, NoSql Databases.
Unstructured Data— Here the data has no defined model or tags to extract relevant information. Example — Image/Text/Audio File.
How Big Data is stored and processed
This is where things gets interesting. Lets get ourselves acquainted with a very popular programming framework “Hadoop”, which solves our problem of Big Data.
Hadoop is an open source(free to use and modify), Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment.
What does the above definition mean? We came to know from the definition that Hadoop is a framework for storing and processing large datasets, but wait, what is distributed computing environment?
In simple terms, Distributed Computing means decomposing the large dataset into smaller dataset and distributing them to different machines so that each machine can process them with its full capacity and thereby making processing faster.
Hadoop Core Components
The major issue with big data was the storage and processing of the data.The following Hadoop components are the base of Hadoop.
HDFS(Hadoop Distributed File System) — Distributed file system primarily used by Hadoop applications.
Map Reduce — Map Reduce is a parallel processing technique where program is written to process the data.
HDFS — Hadoop Distributed File System
In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers — Grace Hopper
Suppose there is a shortage of water in your village and you need to dig a well. You have to dig it really fast or people will suffer. What would you do? You start digging the well yourself with full enthusiasm, time passed and now all your enthusiasm has vanished, so you asked your friend who does a lot of workout. He works better than you but there are limitations to a human body, everyone gets tired no matter how strong they are.He was the strongest person you know who could have done the work and now he gave up too. What now? The villagers saw how hard you and your friend worked, they got emotional thinking about how much you tried to bring water for the village. Some villagers with varying strengths teamed up and came to help you out. Every team needs a leader and you are the leader now.You have to manage them and allocate tasks accordingly. Everyone starts working under your instructions and voila, well is completely dug within few hours and everyone is happy. This is how hadoop’s HDFS works. It does not matter how powerful the machine is, it will not suffice the needs of big data because every machine has a limitation. Distributing the data to different machines and processing them in parallel makes the whole process faster and feasible.
HDFS is a fault tolerant storage distributed system which splits large file into blocks of size 64MB(default) and replicates(3 replications by default) each block of the file at multiple machines connected in a network.
HDFS has a master-slave architecture.Let us talk about some basic terms associated with it.
Nodes are individual machines which can store and process blocks of data in HDFS. HDFS has Name Node and Data Node.
From the village analogy, you are the Name Node,the workers are the Data Nodes.
Name node also know as Master node has all the information about the file. It knows from which data node to access the file block. It does not store any data itself but knows where all the parts of data live. It manages the data nodes by keeping an eye on them. Data nodes sends a signal after a regular time intervals to the name node to indicate that it is up and working. This signal is known as Heartbeat. If Heartbeat is not received by the name node then it is presumed that the respective data node is dead. Remember, name node is a single point of failure, so the Name node should be very robust with high performance hardwares and high bandwith.
Data nodes also known as Slave node stores the blocks of file and all the processing of the data is done here.Data nodes may vary in their hardware specifications, it can be a heavy dedicated server or your laptop as the villagers had different strengths similarly machines too.
Suppose in an election, you are in-charge of calculating all the votes from all the cities and announce the winner. How would you do that ? One way of doing it is to collect all the votes(transferring of data to central location will cost the government) and count the votes each candidate earned.The one with maximum number of votes wins. Just think would it be possible for a single person to calculate all the votes from all the different cities ? You would probably go mad. The other way of doing is that the representative at each polling booth counts the number of votes and store the information of candidate with maximum votes from that booth with candidate’s name as key and the votes he received as value. Note, you cannot determine the winner from the result of just one booth, you need local winners from every booth and find the overall winner. All the key(candidate’s name) value(votes) pairs come to the central location where you have to process relatively less data as compared to all data together. You quickly calculate the maximum vote and announce the winner. This is Map Reduce job. The task of counting of votes independently at each pooling booth and storing the intermediate result is the Map Task. All the intermediate key value pairs are shuffled and sorted alphabetically so as to make the counting for you easier. These intermediate result is what the Reducer gets . Reducer finds out the maximum of all votes and gives the final result. Thus task of doing final deduction is the Reduce Task. The polling booth representative is the Mapper and you are the Reducer.
We write programs to perform map and reduce task. Programs are generally written in Java but with Hadoop Streaming we can write map reduce program in any language.
Resources I used to learn about Big Data and Hadoop
- Udacity’s Intro to Hadoop and MapReduce
4. Many other awesome blogs
I will be writing about how to write Map Reduce code in python in the next part.
I hope this article makes it easy to understand bare basics of Hadoop and Map Reduce.