Understanding Hadoop HDFS

ELMASLOUHY Mouaad
Analytics Vidhya
Published in
4 min readApr 10, 2020

HDFS (Hadoop Distributed File System) is a distributed file system for storing and retrieving large files with streaming data in record time. It is one of the basic components of the Hadoop Apache framework, and more precisely its storage system. HDFS Hadoop is one of Apache’s top level projects.

Architecture

Definitions

HDFS Client: On user behalf, HDFS client interacts with NameNode and Datanode to fulfill user requests.

NameNode: is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself but it stores data about those files or metadata.

DataNode: stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.

Files and blocks: the file is the data which we want to store, when we store a file into HDFS it’s broken to blocks, the default size of each one is is 128/256 MB in Hadoop 2.x and 64 MB in Hadoop 1.x.

Block Replication: Each block is replicated for providing fault tolerance and availability, the default number of replicants is 3.

Flow chart of Write Operation

  1. The HDFS client calls the create() method on DistributedFileSystem to create a file.
  2. DistributedFileSystem interacts with NameNode through the RPC call to create a new file in the filesystem namespace with (file_size, dest_path…) associated with it.
  3. The Namenode makes sure that the file doesn’t already exist and the client has the right permissions to create the file. If all these checks pass, the namenode makes a record of the new file. The DistributedFileSystem returns an FSDataOutputStream for the client to start writing data to Datanode.
  4. As the client starts writing data, the DFSOutputStream(wrapped in FSDataOutputStream) splits the client’s data into packets and writes it to an internal queue called the data queue. then the DataStreamer, which is responsible for telling the NameNode to allocate new blocks by choosing the list of suitable DataNode to store the replicas, uses this data queue.
  5. The DFSOutputStream also maintains another queue of packets, called ack queue, which is waiting for the acknowledgment from DataNodes.
  6. The HDFS client calls the close() method on the stream when it finishes writing data.
  7. The FSDataOutputStream then sends an acknowledgment to NameNode.

Flow chart of Read Operation

  1. the HDFS client calls the open() method on the FileSystem object, which for HDFS is an instance of DistributedFilesystem.
  2. DistributedFileSystem then calls the NameNode using RPC to get the locations of the blocks of a File, For each block NameNode returns the addresses of the close Datanodes from the client that contain a copy of that block. Then the DistributedFileSystem returns an FSDataInputStream to the client from where the client can read the data.
  3. Then the client calls the read() method on the FSDataInputStream object.
  4. The DFSInputStream(wrapped in FSDataInputStream), which contains the addresses of the blocks of the file, connects to the closest DataNode to read the first block in the file.
  5. Upon reaching the end of the file, DFSInputStream closes the connection with that DataNode and finds the best suited DataNode for the next block.
  6. When the client has finished reading the data, it calls close() on the FSDataInputStream.

HDFS Commands

  1. hdfs dfs –mkdir /path/directory_name : to create a directory.
  2. hdfs dfs -ls /path : to enlist the files and directories present in HDFS.
  3. hdfs dfs -put <localsrc> <dest> : it copies files or directory from the local filesystem to the destination in the Hadoop filesystem.
  4. hdfs dfs -get <src> <localdest> : it copies files or directory from the Hadoop filesystem to the destination in the local filesystem.
  5. hdfs dfs –cat /path_to_file_in_hdfs : reads the file in HDFS and displays the content of the file on console or stdout.

Java API for HDFS

write in HDFS

FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path("/path/to/file.ext");
if (fileSystem.exists(path)) {
System.out.println("File " + dest + " already exists");
return;
}
FSDataOutputStream out = fileSystem.create(path);
InputStream in = new BufferedInputStream(new FileInputStream(
new File(source)));
byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
}
in.close();
out.close();
fileSystem.close();

read from HDFS

FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path("/path/to/file.ext");
if (!fileSystem.exists(path)) {
System.out.println("File does not exists");
return;
}
FSDataInputStream in = fileSystem.open(path);
int numBytes = 0;
while ((numBytes = in.read(b))> 0) {
System.out.prinln((char)numBytes));
}
in.close();
out.close();
fileSystem.close();

Thanks for Reading!

--

--

ELMASLOUHY Mouaad
Analytics Vidhya

Computer science Engineer Student, A lover of everything that urges the mind to work hard such as Quantum Physics, General Medicine, Personal dev…