THE UNSTRUCTURED DISRUPTION

Richard Simoes
noiselessdata
Published in
3 min readJul 10, 2015

The big data technology has truly represented a disruption in the data management industry with open source communities playing a dominant role in shaping its evolution. Despite the fact that such an amazing innovation started inside Google’s dungeons, the organic growth of new frameworks, libraries and tools has led to a complex ecosystem that could easily confuse the sharpest architect.

In this post I´ll try to mention briefly the most significant series of events that have guided the evolution of the field over the past 12 years, which I consider is a great way to start with the practice as we go further with the current state of the art processing frameworks.

In October 2003 Google publishes the Google File System paper which is considered the starting point of the modern big data systems. This seminal work on Distributed File Systems, along with the MapReduce paper published also by Google on December 2004, inspires Doug Cutting and his team, that were working on a similar scale of data management problems, to create Hadoop in 2006 while he was working at Yahoo!.

As soon as Hadoop (HDFS+MapReduce) hits web-scale on February 2008 Yahoo launched what it claimed was the world’s largest Hadoop application that runs on a Linux cluster with more than 10,000 cores and produced data that was used in every Yahoo! web search query. On June 2009 Yahoo! made the source code of the version of Hadoop it runs in production available to the public and the real adoption of the framework started.

A myriad of open source projects developed in the subsequent years added new capabilities or enhancements to the core system, Pig and Hive helped to overcome the burden of using the MapReduce programming model. Kafka, Flume and Sqoop were introduced to facilitate the data movement in and out of the system. Storm offered new set of primitives to process data in real-time, Mahout exposed libraries to build data mining and machine learning jobs, Giraph for graph processing, just to name a few.

As time and framework adoption progressed the community identifies several limitations, specially associated with the MapReduce model and how difficult it was to express some workloads. Along with some missed opportunities in terms of performance improvement using some well known techniques at the time for in-memory computing and scheduling.

Those kinds of innovations could only take place in the core of the ecosystem, and that is how a second generation of processing engines are born. Apache Spark is the most prominent project of this new breed followed by Apache Tez and Apache Flink which is getting more traction in some european countries, specially Germany were it was started.

Currently Apache Spark is well accepted in the community as a good replacement for MapReduce, but the scope of the project doesn’t ends there, since it also offers several ready to use frameworks built on top of its core to address some of the more common big data workloads: sql interactive analysis, machine learning, streaming and graph processing. I´ll be showing you how to leverage Spark for many big data projects in the following posts.

Does this means that Hadoop has been completely displaced by these frameworks? No, what is being replaced is MapReduce. Several other core components of the ecosystem such as HDFS are still leveraged by the new frameworks, in fact some of other high-level projects like Hive, Pig and Cascading are being migrated to the Spark ecosystem.

A final remark about this story, as Cloudera’s Mike Olson has quipped, “We’re lucky to live in an age where there’s a Google. They live about 5 years in the future, and occasionally send messages back to the rest of us.” I didn’t wanted to go deeper on each project just mentioned, but if you do the research you will notice that the intellectual contributions coming from Google have been a key foundation for most of them.

--

--

Richard Simoes
noiselessdata

Pragmatic computer engineer and data analysis nerd.