Hadoop : Modules & Process Flow
Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.
Why Hadoop is required ?
As the data is growing on a large scale and everyone wants to analyze and extract meaningful information for business use case, there are certain challenges that comes with big data :
- Capturing Data
- Curation
- Storage
- Searching
- Sharing
- Transfer
- Analysis
- Presentation
To address the above challenges, we use the Hadoop Modules
Hadoop Modules
- HDFS — Hadoop Distributed File System, in this files will be broken into blocks and stored in nodes over the distributed architecture.
- Yarn — Yet another Resource Negotiator is used for job scheduling and manage the cluster.
- Map Reduce — This is a framework which helps programs to do the parallel computation on data using key value pair (more details in separate article).
- Hadoop Common — These are the java libraries which are used to start Hadoop and used by other Hadoop modules
Hadoop Process Flow
Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs :
- Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128 MB or 64 MB
- These files are then distributed across various cluster nodes for further processing
- HDFS, being on top of the local file system, supervises the processing
- Blocks are replicated to multiple nodes for handling hardware failure
- Checking that the code was executed successfully
- Performing the sort that takes place between the map and reduce stages
- Sending the sorted data to a certain node
- Writing the debugging logs for each job
Hadoop map reduce in detail : https://medium.com/@abhisheksukhwal9/hadoop-map-reduce-in-detail-ac1c8429b03b