Big Data and a brief intro to Map/Reduce and HDFS in Hadoop

Krishna Kanth
Beginner @ Data Science
5 min readAug 7, 2014

--

The present world is turning, or rather reshaping itself, into a repository of everyday information; information of numerous kinds such as – text, photographs, videos, numbers, transactional records, etc. With the phenomenal changes in the technology world it has become that these very advancements of technology are easily available to the normal man — for example Smartphones; the devices that never existed some years ago are now redefining the way we interact ourselves with technology. And, as we pick up one end of a stick we pick up the other end too, so with this fast growth of such smart devices the information that is shared today has also radically increased.

This increasing heavy torrent of information, which in technical terms is known as ‘data’, has incarnated itself as the BigData!

BigData is nothing but the fast growing every day data. You can’t even imagine its growth, which is in terabytes and petabytes.

To manage such huge amounts of data a platform is needed and hence the existence of Hadoop.

What is Hadoop?

Hadoop is a framework that is capable of processing heavy volumes of unstructured data.

It is an arena that provides computational capabilities and distributed storage for the data. Elaborating upon each of the two words, firstly, computational capability means the ability to perform computations on the high amounts of available data which at its core is unstructured. Unstructured, meaning it is in contrast with the present day Relational data storages where the data is already kept well refined and further which data access operations are performed upon to dig out relevant information in the form of queries (SQL).

BigData is a collection of cumbersome amounts of information that is dumped into corresponding large storages, and talking of large storages, the second word, distributed storage, comes into the picture. When a data unit is stored, distributed storage will mean that the data unit is to be made available for all the places where its requirement persists. To look at a real world example of distributed storage, a movie release is made at many places all over the world (distributed) only to make it available to large number of people that are divided by certain geographical barriers.

These are the concepts of ‘computational capability’ and ‘distributed storage.’

And how does Hadoop implement these two concepts?

With the help of MapReduce and HDFS.

MapReduce:

Firstly, let me confess that a full discussion of MapReduce concept is beyond the scope of this article, so here I’ll give you a glimpse of what actually happens inside it.

MapReduce is a method researched and published by Google and is widely used in their web search. The whole paper is available here.

MapReduce is two method process containing the Map() and the Reduce() functions. They both work on a ‘key : value.’ For example to compute the number of words in a document, we firstly pass each word as the parameter into the Map() function: Map(food), Map(eat), Map(you), Map(table), Map(place), Map(market), Map(bought) etc.

Now, the output of the function will be ‘key : value’ pairs. The Map() function counts the number of characters in the word by taking the count as a ‘key’ and the word itself as a ‘value.’ So the output would be like:
3 : eat
3 : you
4 : food
5 : table
5 : towel
5 : place
6 : market
6 : bought

Further, they are grouped together as:
3 : [eat, you]
4 : [food]
5 : [table, towel, place]
6 : [market, bought]

Secondly, each of these outputs containing a key and a list of values is passed on as the parameter to the Reduce() function. It will finally output the number of words in the list for each key, again as a key : value pair.
3 : 2
4 : 1
5 : 3
6 : 2

Thus the count of characters in a word will be made. This is what the MapReduce can perform. A best real world usage of this concept is, as mentioned above, in Google web search. The total number of words in the whole Web (internet) are brought by crawlers and spiders and are sorted into such key : value pairs using the MapReduce function, basing upon which further word indexes are created, which is a fundamental phase in the Google search.

HDFS

HDFS elaborates to be Hadoop Distributed File System. This is a file system that’s used to operate upon very large data sets which the present day’s technology is producing on immense proportions. The size of file units stored in HDFS can range from Gigabytes to Terabytes, and sometimes even larger.

NameNodes:

NameNode is the repository of mappings to various DataNodes, meaning that it contains the information regarding the mappings between different files, their locations and their corresponding DataNodes that are branched under the NameNode.

DataNodes:

DataNodes are the actual areas where the data/files are stored in the file system. There will be numerous DataNodes linked to one NameNode. They send reports regarding their files to the NameNode for every 10 seconds. This report is called Heartbeat. It proves that a particular DataNode that has successfully reported its Heartbeat to the NameNode, is safe and secure and is alive and active. So, when a beat is skipped by a DataNode the NameNode instantly recognizes the inactiveness in that DataNode; and when this inactiveness continues for 10 minutes, the DataNode is declared dead and then on no IO will be sent to that node, and the data present in it is replicated to another DataNode and these new changes are updated into the NameNode’s mappings.

Racks and Replications:

Files are maintained as a series of blocks and the size of all the blocks are same expect for the last block. And, many such blocks are put into Racks.

Replication is performed into the racks and these replication decisions are taken by the NameNode.

Replication of files is placed into different unique racks to ensure against the possibility of data loss, just in case a complete rack failure should occur. And to write these replications into different racks it would cost more writing; but then, in the aftermath of rack failure, wouldn’t we regret not writing data to other racks than losing all of it!

EditLog:

There will be new files and directories created in the file system. As these changes occur in the metadata, they’re all recorded in a log called, EditLog and later updated in the NameNode.

HDFS is the most advanced distributed file system and is undergoing a quick transformation and coming of age with the advancement in the technologies. This is going to be a huge requirement for the BigData generation that’s shaping up in front of our very eyes.

--

--