Distributed File Systems and Object Stores on Linode (Part 3) — HDFS

In part 1 of this series, we looked at GlusterFS. In part 2, we looked at the Ceph Object Store. In this concluding part, we look at HDFS which is arguably the most popular among the three in big data ecosystems and is also quite different from the other two.

HDFS — Hadoop Distributed File System — is a file system that was designed for the Hadoop distributed data processing system. It remains the file system of choice in the big data ecosystems of Hadoop and Spark because it has proved itself in many large deployments of major software companies.

Unlike Gluster and Ceph which are written in C or C++ and run natively, HDFS is a Java software that runs in a JVM (Java Virtual Machine). While it can be used as a general-purpose distributed file system, its architecture, behavior and tooling are oriented towards scalable distributed data processing of very large datasets ranging from hundreds of terabytes to hundreds of petabytes.

Architecture

In HDFS, DataNodes store data by breaking up large files into smaller stripes, storing those stripes across one or more blocks, and replicating those blocks on multiple nodes.

The same DataNodes that serve as storage nodes in the HDFS layer also serve as compute nodes in the Hadoop layer for executing data processing components of Hadoop or its ecosystem. Each DataNode attempts to execute the processing steps on just its own locally stored data blocks. This way, HDFS achieves data locality and reduces network transfers between DataNodes:

HDFS Architecture (non-HA)

A NameNode acts as a central lookup table to map file paths and blocks to DataNodes.

Historically, NameNode was neither scalable nor highly available. In later versions, HDFS Federation was introduced to address the scalability problem and NameNode HA to address the availability problem.

HDFS Federation scales by sharding the file system namespace across many NameNodes.

HDFS NameNode HA Architecture

NameNode HA supports multiple NameNodes in active-standby configuration with manual or automatic failover, and uses either a dedicated journal or shared NFS storage to keep the standby nodes in sync with the active node. For automatic failover, NameNode HA requires a Zookeeper cluster.

HDFS storage capacity is highly scalable. Just provision a new DataNode, configure it with the NameNode address, and wait for sometime to let it announce itself to the NameNode and join the cluster. Despite a centralized metadata architecture and a JVM layer sitting between it and the OS, HDFS is able to scale very well. Installations of up to 400+ PB and thousands of nodes have been publicly reported by major software companies.

Infrastructure Planning for HDFS

Configuration for DataNodes

DataNodes require medium to high storage, high RAM to take advantage of HDFS cache-locality and high network transfer rates to support data locality.

DataNodes also require high CPU cores and RAM because they act as compute nodes for Hadoop or Spark distributed processing,

Since all the data transfer is within a datacenter, outgoing transfer caps don’t matter for this selection.

The following configurations are suitable:

Linode Plans for HDFS DataNodes

Configuration for NameNodes

NameNodes require high RAM, because all the distributed namespace mappings are loaded to memory. High CPU and storage capacity are desirable for journal log storage and for periodic updation of persisted file system image using the journal entries. RAM usage is proportional to number of files stored in HDFS. Linode 12GB to 48GB RAM configurations are suitable depending on number of files stored.

For NameNode HA using QJM, JournalNode daemons should be deployed either on the NameNodes themselves, or more preferably, on dedicated nodes. JournalNode daemons require quorum configuration, and therefore, should have an odd number of nodes running them. Since NameNode automatic failover is desirable and is contingent on a Zookeeper cluster which also requires quorum configuration, you can have an odd number of dedicated nodes running JournalNode and Zookeeper daemons. Low configurations such as Linode 4GB or 8GB are enough for such nodes.

DNS

Since Linode does not provide name resolution for private IPs, you should deploy DNS servers to do so. The NameNodes can also act as DNS servers. Split-horizon DNS is not required as long as the clients — that is, the Hadoop or Spark components — are in the same datacenter (that is, in the same private network).

Performance of HDFS on Linode

TestDFSIO benchmarking was performed on a YARN+HDFS cluster of 2x Linode 24GB instances (with 8 cores, 24GB RAM, 384GB of disk space and 2 Gbps outgoing bandwidth on each), and one 2GB master node acting only as HDFS name node and YARN resource manager. Block replication factor was set to 2.

Read tests showed that throughput was maximum when number of map tasks matched the number of cores and file size was 2GB.

For smaller file sizes, throughput was lower simply because the system was not being utilized optimally. For larger file sizes, throughput dropped, probably because they could no longer be cached in memory.

Write throughput was much more consistent across number of tasks and file sizes.

A few tasks at small file sizes resulted in low throughput likely because the system was underutilized. Once all cores were engaged or file sizes went beyond RAM size, throughput was consistently in the 120–160 MB/s range, which is around 20% of the raw disk throughput and 50% of the outgoing network throughput.

Deploy on Linode

Hdfs-linode is a set of interactive scripts being developed to automatically create and provision HDFS cluster on the Linode cloud. They are written in Python, use Ansible for deployment and support Ubuntu 14.04 / 16.04 LTS, Debian 8 and CentOS 7.

Follow its development at https://github.com/pathbreak/hdfs-linode, and to get notified when there’s a release version.

Conclusions

HDFS is relatively easy to deploy. Linode instances with high disk and high network throughput should be preferred for HDFS data nodes. The performance tests showed that YARN should be configured such that number of tasks matches number of available cores.

While neither a fully POSIX-compliant file system like GlusterFS nor an object store like Ceph, a big plus for HDFS is that it’s designed to colocate compute and data on the same or nearby instance, unlike GlusterFS or Ceph which are not optimized for compute placement. Since most data is located nearby in terms of network hops, network bottlenecks are reduced, and this is the reason HDFS shows much higher throughput numbers.

Credits

A big thank you to Dave Roesch and Keith Craig for providing Linode infrastructure and suggestions that made this article possible.


About me: I’m a software consultant and architect specializing in big data, data science and machine learning, with 14 years of experience. I run Pathbreak Consulting, which provides consulting services in these areas for startups and other businesses. I blog here and I’m on GitHub. You can contact me via my website or LinkedIn.


Please feel free to share below any comments or insights about your experience with Object Storage in general or HDFS in particular. And if you found this blog useful, please consider sharing it through social media.


While Karthik’s views and cloud situations are solely his and don’t necessarily reflect those of Linode, we are grateful for his contributions.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.