Hadoop Ecosystem Components
1. Introduction to Hadoop ecosystem components:a. Hadoop Distributed File System
HDFS is the primary storage system of Hadoop. Hadoop distributed file system (HDFS) is java based file system that provides scalable, fault tolerance, reliable and cost efficient data storage for big data. HDFS is a distributed filesystem that runs on commodity hardware. HDFS is already configured with default configuration for many installations. Most of the time for large clusters configuration is needed. Hadoop interact directly with HDFS by shell-like commands.
Components of HDFS:
i. NameNode
It is also known as Master node. NameNode does not store actual data or dataset. NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which Datanode the data is stored and other details. It consists of files and directories.
Tasks of NameNode
- Manage file system namespace.
- Regulates client’s access to files.
- Executes file system execution such as naming, closing, opening files and directories.
ii. DataNode
It is also known as Slave. HDFS Datanode is responsible for storing actual data in HDFS. Datanode performs read and write operation as per the request of the clients. Replica block of Datanode consists of 2 files on the file system. The first file is for data and second file is for recording the block’s metadata. HDFS Metadata includes checksums for data. At startup, each Datanode connects to its corresponding Namenode and does handshaking. Verification of namespace ID and software version of DataNode take place by handshaking. At the time of mismatch found, DataNode goes down automatically.
Tasks of DataNode
- DataNode performs operations like block replica creation, deletion and replication according to the instruction of NameNode.
- DataNode manages data storage of the system.
b. MapReduce
Hadoop MapReduce is the core component of hadoop which provides data processing. MapReduce is a software framework for easily writing applications that process the vast amount of structured and unstructured data stored in the Hadoop Distributed File system.
