HDFS is dead?

Chirag Singla
3 min readJul 4, 2020

--

Context: “HDFS is dead and object stores like S3/GCP Buckets/Blob store is the way forward” came up in our office discussion.

Some of the questions which came to my mind are:
1) Is HDFS really dead? Have all companies moved away from HDFS to ObjectStore?
2) When should one use HDFS and when should one use Cloud Storage?
3) Which one is more cost effective?

Lets answer each of the questions one by one:
1) Is HDFS really dead? Have all companies moved away from HDFS to ObjectStore?
HDFS is not dead. There are companies which are still using HDFS/GFS extensively for Data Storage. Companies are moving towards ObjectStores like S3/Blobstore/Buckets for storing the primary data because of the cost benefits and low data maintenance costs using the ObjectStores provided by the Cloud providers. For all the MapReduce kind of workloads HDFS gets used in following ways:
1) Caching intermediate results during the Map Reduce processing
2) Workloads which have random I/O

2) When should one use HDFS and when should one use Cloud Storage?
Pros of moving to Cloud Storage:
1) Lower costs: Pay for what you use instead of buying hardware for what you may eventually need.
2) Separation of compute and storage: Setting up and tearing down the clusters becomes super easy.
3) Interoperability: Seamless interoperability between Hadoop, GCP and other Cloud Services.
4) No storage Management overhead: No routine maintenance/version upgrades required.
5) High Data availability: Not vulnerable to name node failures
6) HDFS compatibility with equivalent performance:

Cons of moving to Cloud Storage:
1) Increased I/O variance: Problematic when the application is backed by HBase or any other NoSql database.
2) File appends or truncates no supported
3) Not POSIX compliant: Directory renames are not atomic as they are metadata pointers(?)
4) Does not expose all filesystem information: Does not output information about racks and corrupted blocks
5) Greater request latency: Problem only for small jobs/large number of files or large number of sequential file operations.

3) Which one is more cost effective?
“S3 and cloud storage provide elasticity, with an order of magnitude better availability and durability and 2X better performance, at 10X lower cost than traditional HDFS data storage clusters” — Analysis done by Databricks
“A terabyte of cloud object storage costs about $20 a month, compared to about $100/month for HDFS” — Analysis done by Venturebear
“substantial savings — Enterprise Strategy Group (ESG) research found that organizations that move from an always-on HDFS deployment on-prem to Cloud Storage (on Cloud Dataproc) typically have a 57% lower total cost of ownership.” — As reported on Google Cloud Blog

References:

HDFS use in Dataproc/EMR:

Google and Databricks link:

Other Useful Links:

--

--