There has been a resurgence of the “Hadoop is dead” narrative, and it seems like every so often this pops up in the form of a blog post or contributed article. For several years now, Cloudera has stopped marketing itself as a Hadoop company, but instead as an enterprise data company. And today, Cloudera is in the Enterprise Data Cloud market: hybrid/multi-cloud and multi-function analytics with common security & governance — all powered by open source.
Having said that, it’s challenging to operate in a sea of negativity around the “Hadoop is dead” narrative. Here is my take — which hasn’t changed in the 13+ years I’ve had the insane privilege to be involved with the open-source community in this space — a space we created, together.
What is Hadoop?
Let’s start with the basics — Hadoop started as a single open-source project at the Apache Software Foundation with HDFS and MapReduce for batch applications, but quickly begat a broad, rich, and open ecosystem. Today Cloudera’s “Hadoop distro” (CDH/HDP/CDP) contains more than 30 open source projects across the spectrum of storage, compute platform (e.g. YARN, and in future Kubernetes), batch/real-time compute frameworks (Spark, Flink etc.), orchestration, SQL, NoSQL, ML, security/governance and much more.
So, if you define Hadoop as just MapReduce, then yes, I would agree — MapReduce is in decline. But that didn’t matter with the emergence of Spark, Flink and all the other innovations we embraced as part of our ecosystem — much to the delight of our customers. That’s the beauty, and the strength of the platform — it can evolve to embrace new paradigms.
So, if Hadoop isn’t a “project” or a set of projects, what is it?
Personally, “Hadoop” is a philosophy — a movement towards a modern architecture for managing and analyzing data.
Uh, come again?
The “Hadoop Philosophy”
The Hadoop Philosophy has always been about the following tenets:
0. A movement towards a disaggregated software stack with each layer (storage, compute platform, compute frameworks for batch/realtime/SQL etc.) built as composable legos and away from monolithic and inflexible software stacks (e.g. a database with its custom storage format, parser, execution engine, etc. in a vertically integrated fashion).
- This is, in particular, aided by the establishment of an open metadata, security and governance platform to harmonize the disaggregated stack.
1. A movement towards leveraging commodity hardware for large-scale distributed systems and away from proprietary/monolithic hardware+software stacks.
- In economic theory, commodity is defined as a good or service that has full or substantial fungibility with wide availability, which typically leads to smaller profit margins and diminishes the importance of factors (such as brand name) other than price.
- See below for a discussion on how this translates very well, architecturally, to the emergence of the public cloud.
2. A movement towards leveraging open data standards & open source technologies and away from proprietary, vendor-controlled technologies. It’s not merely open standards - the standard is the implementation and not just a “specification”.
3. A movement towards a flexible & ever-changing ecosystem of technologies (MRv1 -> YARN -> K8s, MapReduce -> Spark/Flink etc.) and away from one-size fits all monolithic stacks, enabling innovation at every layer.
In some ways, the “Hadoop Philosophy” is to data architectures what the famous Unix Philosophy by Ken Thompson, is to software development. Many of the 17 Rules for Unix as elucidated by Eric Raymond in his famous book, Art of Unix Programming, apply to this space too:
1. Rule of Modularity: Write simple parts connected by clean interfaces.
- HDFS, YARN/K8s, Spark, Hive etc. are composable and rely on each other.
3. Rule of Composition: Design programs to be connected to other programs.
- Impala, Hive, Spark, etc. are highly composable for end-to-end solutions.
4. Rule of Separation: Separate policy from mechanism; separate interfaces from engines.
- HDFS is as much a file system interface, as a file system implementation. This is why Spark talks to S3 via the Hadoop Compatible Filesystem “API”.
6. Rule of Parsimony: Write a big program only when it is clear by demonstration that nothing else will do.
- We avoid “big”/”fat” layers and instead work with modular layers that rely on the other e.g. Phoenix and HBase.
7. Rule of Transparency: Design for visibility to make inspection and debugging easier.
- Open Source FTW!
16. Rule of Diversity: Distrust all claims for “one true way”.
- Our ecosystem provides multiple tools because they make sense for scenarios and have different strengths (ETL via Spark or Hive, SQL via Hive/Tez or Impala or LLAP or SparkSQL).
17. Rule of Extensibility: Design for the future, because it will be here sooner than you think.
- It was impossible to predict the emergence of HBase, Hive, Impala, Spark, Flink, Kafka etc. at the beginning in 2005–06, we’ve done really well to make them first class and key components of the stack in the 13+ years.
What about the Cloud?
The public cloud (along with private cloud) is clearly going to be an integral part of the deployment architecture for enterprises from this time forward.
Public cloud is essentially commodification of the enterprise hardware infrastructure (servers, networking, data-centers etc.). As such, it is perfectly aligned with the tenets of the Hadoop Philosophy — the focus on commodity hardware. Furthermore, the entire Hadoop ecosystem has always been built to ‘shape-shift’ and absorb new influences — Tom White wrote the first S3-Hadoop connector in 2006, Amazon introduced EMR in 2009.
Contrast this with how hard it is for a legacy database vendor to decompose a monolithic and highly engineered/converged hardware/software stacks to make them work “natively” in the public cloud.
Unfortunately, as an industry, we have done a poor job of helping the market (especially financial markets) understand how “Hadoop” differs from legacy technologies in terms of our ability to embrace the public cloud. Something to ponder, and fix.
AWS EMR, Azure HDInsight and Google Dataproc are great examples of how “Hadoop” is driving value and business for giants, at scale, in the public cloud within their customer base.
What about Cloudera?
Cloudera is a data company. We empower people to turn data into clear and actionable insights. We do it by embracing the “Hadoop Philosophy.” We built this market — we are proud of our past, but aren’t blinded by it. We adopt technological waves (public cloud, Kubernetes etc.) because they make sense, benefit our customers, and they are aligned with our mission.
I love the Bezos philosophy: Focus on things that don’t change. A hundred years from now, enterprises will still want to turn data into insights. That’s what we do, and will continue to do so.
Some things have definitely changed for us — things we need to heed. Five years ago, when we were the “it” technology, we got a hall pass. All the cool kids wanted to hang with us, and brought us all the use-cases they could find, and showed us off to their friends. To some extent, “the answer is Hadoop — what’s the question?” was the prevailing sentiment. This led to some unreasonable expectations which were unrealistic, or too early in the product life-cycle. Now we have to work a little harder to convince customers to use what we bring to market, but the value we bring to them, and the philosophy, are unquestionable. We also need to convince customers to use these technologies from us such as CDP. But work with us they do, as evidenced by the thousands of petabytes of data and millions of analytical applications they run on our collective platforms — today!
Essentially, we will continue to thrive by participating in use-cases where users and enterprises want to store/manage/secure/govern/analyze data. We need to be willing to be misunderstood for a while as this narrative resurfaces and recedes — as we deliver results. All great companies are misunderstood from time to time, but the enduring ones persevere.
I saw this comment on social media the other day:
“If I use CDP with Spark running on Kubernetes analyzing data residing in S3, where is Hadoop?”
I actually laughed out loud, and thought:
As long as you use the CDP service… :-)
Gartner analyst Merv Adrian likes to tell a similar story about a customer who said his “favorite Hadoop application” was using Tensorflow with Spark against S3. Merv asked him why that was Hadoop, and the response was that it was “Hadoop” because the Hadoop team built it. Also, the Spark in use did come from a Hadoop distribution. Hence, Merv makes the point: “Hadoop is often in the eye of the beholder.”
The fundamental goal of CDP is to ensure that as a cloud service we makes it much simpler for enterprises to derive value from the platform without dealing with the complexities of the powerful technologies. In particular, the experiences we deliver with CDP for with native SaaS-like services for data-warehousing and machine learning really make it trivial for the business users for driving analytics on data stored in Cloud object stores. Furthermore, SDX makes it trivial to setup up fully-secured data-lakes with ABAC and fine-grained policies across data stored in object stores and on-prem HDFS, along-with lineage & provenance for governance and encryption (storage, and on the wire). The progress we’ve made on that front is very, very exciting — as we’ve seen from the feedback provided by many, many enterprise customers!
So, is Hadoop dead?
The old way of thinking about Hadoop is dead — done, and dusted. Hadoop as a philosophy to drive an ever-evolving ecosystem of open source technologies and open data standards that empower people to turn data into insights is alive and enduring.
As long as there is data, there will be “Hadoop”.
Hadoop is dead. Long live “Hadoop.”
Apache Hadoop, Apache Spark, Apache Flink, Apache Hadoop HDFS, Apache HBase etc. are trademarks of the Apache Software Foundation.