What is Big Data? What is Hadoop?
This article is for those guys who are learning hadoop or just starting to learn about hadoop. I am also learning so any suggestions about this article will be helpful. Ill try to make it as easy as i can, so that the people that are not related to IT(information technology) can also learn if they want.
What is Big Data? To explain this let me explain what is Data. Data can be defined as a collection of facts or information from which conclusions may be drawn. For Example .doc, .docx. xls files are common, i.e they can be found in any PC, Laptop. These files contain some Information hence these information is basically Data.
Lets take a dive in history. The amount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes, If you pile up the data in the form of disks it may fill an entire football field. This same amount of data is being generated with an enormous rate, like every minute we are producing some billion gigabytes of data. And how we are generating this data? The data is generated by various sources like- our Social Networking sites, e-commerce sites, Data generated by the sensors used in NASA(NASA generates PetaBytes of data) hence these amount of data which are being generated contains some meaningful information that is required by some organisation now the point is how can we extract that kind of information when the data volume is so HUGE. We have some traditional techniques which we use to extract information Some technologies to extract data are MySql, IBM DB2, Oracle. But these are traditional techniques, they are being used to analyze Structural data (A Relational Data which contains Rows and columns), but the data which is being generated by all other sources are Unstructured data ( information that either does not have a pre-defined data model or is not organized in a pre-defined manner), Examples:- Server log files, PDF files, Word etc.
Other main point of the big data is the “velocity”. In Relational databases information can be extracted easily because the Rig which is being used is decent, plus the amount of data is not voluminous, but how about extracting some data from a 1 Petabyte of file can we extract. Yes we can but the query which will be processed in the background will give us the result after some period of time. Hence that is useless, so velocity also plays a big role in the problem Big Data.
Hadoop was created by computer scientists Doug Cutting and Mike Cafarella in 2006. Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters. Genesis of hadoop came from a Research paper from google i.e Map Reduce. MapReduce is a software framework in which an application is broken down into numerous small parts. Any of these parts, which are also called fragments or blocks, can be run on any node in the cluster. Ill explain this in depth, Hadoop uses HDFS(Hadoop distributed file system), HDFS is a distributed file system that is designed to run on large clusters i.e( thousands of computers) of small computer machines in a reliable, fault-tolerant manner(If one computer fails then other can continue to work hence not failing the entire process). HDFS uses a master/slave architecture where master consists of a single NameNode that manages the file system metadata(data about which file block is stored in which node, a Directory like tree structure) and one or more slave DataNodes that store the actual data. A file in HDFS is splited into several blocks(Splitting the files in 4 parts default size 64mb in Apache Hadoop) and those blocks are stored inside each data nodes and there replicas are created in other Data nodes. HDFS provides a shell like any other file system and a list of commands are available to interact with the file system. Now map reduce plays a key role in analyzing the enormous data and giving us the suitable result, the part of map reduce is taken care by the Job tracker which performs the map reduce operations on the data nodes, and it keeps track of the process by the help of the task tracker which sends heartbeat signals time to time.
I am still learning hope that I have covered the basic things that are required to know about Big Data. Please Suggest me any improvement regarding this It will help me learn more and it will also give me motivation to write other things related to Hadoop.