The Future of Hadoop

Hadoop as a concept revolutionized the world of data processing, and ushered in the era of Big Data”

Almost 20 years ago, Doug Cutting faced two issues in creating a web search engine: how to reliably store all that information, and how to create a massive lookup index. Thus was born Hadoop, which included a distributed, highly available file system, as well as the Map-Reduce framework for massively parallel computations.

MapReduce was indeed revolutionary previously intractable problems could now be solved in a matter of minutes. But, it did not take advantage of memory to improve performance, and it was terrible at handling incremental changes, e.g., adding the index for a single new tweet to the existing full web index. In time, Hadoop replaced the original MapReduce framework with Tez, which uses a directed acyclic graph for parallel processing, based on Microsoft’s 2010 Dryad paper. But, Tez has been upstaged by another product based on Dryad: Spark. Spark’s implementation is more general purpose, e.g., data at various stages of computation can be efficiently checkpointed and restored. Spark can run in the Hadoop ecosystem (where it will soon replace Tez), or it can run in its own stand alone environment. More and more projects are choosing Spark as their Big Data solution, and then, as a secondary decision, choosing between Spark on Hadoop or Spark standalone. Over 25percent of Spark projects today run outside of Hadoop, and the percentage is rising.

The Hadoop File System (HDFS) is also showing its age. For example, it requires an active NameNode in order to function, and it uses Zookeeper to monitor the NameNode’s availability. As a result, it can experience “brown-outs” of up to a minute while Zookeeper detects that the active NameNode has crashed. Hadoop has evolved mechanisms to improve availability, but other Big Data systems, such as Cassandra’s, achieve high availability without the need for a master node or an external monitoring facility, thus eliminating the risk of brown-out.
The trend is clear. Hadoop as a concept revolutionized the world of data processing, and ushered in the era of Big Data. But, Hadoop as a product ecosystem is certainly showing its age, and, for many use cases, it has been upstaged by more modern technologies like Spark, which had the benefit of learning from Hadoop’s growing pains. Spark has a more generic and extensible programming model, which makes it easier to use for analytics. It also can handle Big Data in Motion, via Spark Streaming, and serves as the basis for a powerful graph database (GraphX) and a full featured data science library (MLib).

To know more click here-

Show your support

Clapping shows how much you appreciated steve’s story.