The Future of Hadoop and Big Data Analytics

Co-written by Shaofeng Shi and Coco Li.

Coco Li
Kyligence
10 min readFeb 15, 2022

--

Photo by: Ansgar Scheffold

In April 2021, the Apache Software Foundation announced the retirements of 13 big data-related projects, 10 of which are part of the Hadoop ecosystem, such as Eagle, Sentry, Tajo, etc. Now, Apache Ambari, born with the mission to manage Hadoop clusters, becomes the first Apache project to be retired in 2022.

Apache Ambari moved into Attics

As a complete open source big data suite, Apache Hadoop has profoundly influenced the entire big data world over the past decade. However, with the development of various emerging technologies, the Hadoop ecosystem has undergone tremendous changes. Is Hadoop really dead? If so, what products/technologies will replace it? What is the future outlook for big data analysis?

This article will analyze the history of Hadoop, the emerging technology options under the cloud-native trends, and the future outlook of big data analysis in the next 10 years.

Contents

  1. Born for big data
  2. The past 10 years of Hadoop
  3. What kills Hadoop? 3 Key Factors
  4. How to face the Post-Hadoop era — tips for practitioners and tech vendors
  5. More future trends of big data and analytics

Born for BIG DATA

For the past two decades, we have been living in an era of data explosion. The amount of data created in traditional business, such as orders and warehousing, has increased relatively slowly, and its proportion in the total amount of data has gradually decreased.

Image by Author

Instead, massive amounts of human data and machine data (logs, IoT devices, etc.) have been collected and stored in quantities far exceeding traditional business data. A huge technology gap exists between the massive amounts of data and human capabilities, which has spawned various big data technologies. In this context, what we call the era of big data has come into being.

According to the industry consensus, a big data system needs to meet the 3V requirements:

Image by Author
  1. Volume: processes a huge amount of data;
  2. Velocity: fast speed of data processing;
  3. Variety: processes a large variety of data, including structured, semi-structured, unstructured, and even images and videos.

Hadoop is such a fully functional big data processing platform. It contains a variety of components to meet different functional requirements, such as HDFS for data storage, Yarn for resource management, MapReduce and Spark for data calculation and processing, Sqoop for relational data collection, Kafka for real-time data pipelines, HBase for online data storage and access, Impala for online ad-hoc queries, etc.

Image by Author

Hadoop used clusters for parallel computing soon after its birth, breaking the sorting record held by supercomputers. It has been widely adopted by companies and various organizations with proven strength.

The Past 10 Years of Hadoop

Thanks to “big data” and the influential Apache community of open-source software projects, Hadoop has become rapidly popular, and many commercial companies have emerged.

Top Hadoop distributors in the market include the three vendors — Cloudera, Hortonworks, and MapR. In addition, public cloud vendors also provide hosted Hadoop services on the cloud, such as AWS EMR, Azure HDinsight, etc., which account for the majority of Hadoop’s market share.

In 2018, nevertheless, the market experienced drastic changes. A piece of big news shocked the Hadoop ecosystem: Cloudera and Hortonworks merged.

News by Chirs Preinesnerger, and Daniel Newman

In other words, the №1 and №2 market players embraced each other to survive in the market. Then, HPE announced the acquisition of MapR. These M&As indicated that despite Hadoop’s extreme popularity, the companies faced difficulties in operation and found it hard to make money.

After merging Hortonworks, Cloudera announced that it will charge for all product lines, including previous open-source versions. Open source products are no longer available to all users, but to paying users only.

Image from Cloudera official website: source

The HDP distribution, which has been available for free in the past, is no longer maintained and available for download. It will be merged into a unified CDP platform in the future.

What kills Hadoop? 3 Key Factors

It’s not unexpected that Hadoop has been gradually losing its aura. Any technology, by the way, will go through the cycle of development, maturity, and decline, and seems like no technology can escape this “lifecycle curse”.

Google Trends shows that interest in Hadoop reached its peak popularity from 2014 to 2017. After that, we see a clear decline in searches for Hadoop.

What are the reasons for Hadoop’s fall? From my point of view, 3 main factors jointly led to the fall of Hadoop:

  1. New market demands for data analytics and emerging technologies
  2. Fast-growing cloud vendors and services
  3. Increasing complexity of the Hadoop ecosystem

1. New Market Demands for Data Analytics and Emerging Technologies

Looking back at the development history of Hadoop, it can be seen that the software framework has emerged because of the strong demand for big data storage. In the present day, however, users have new demands for data management and analysis, such as online rapid analysis, separation of storage and computing, or AI/ML for artificial intelligence and machine learning. In those respects, Hadoop can only provide limited support. In this regard, it cannot be compared with some emerging technologies. For example, Redis, Elastisearch, and ClickHouse, which have been very popular in recent years, can all be applied to big data analysis.

For customers, there is just no need to deploy the complex Hadoop platform if a single technology can meet their demand.

2. Fast-growing Cloud Vendors and Services

From another perspective, cloud computing has been developing rapidly in the past decade or so, not only beating traditional software vendors such as IBM, HP, etc. but also encroaching to a certain extent on the big data market of Hadoop.

In the early days, cloud vendors only deployed Hadoop on IaaS, such as AWS EMR (claimed to be the most deployed Hadoop cluster in the world). For users, the Hadoop services hosted on the cloud can be started and stopped at any time, and the data can be safely backed up on the cloud vendor’s data service platform, which is easy to use and cost-saving.

AWS Data Services: source

Besides that, cloud vendors render a range of big data services for specific scenarios to form a complete ecosystem, such as persistent and low-cost data storage implemented by AWS S3, KV data storage, and access with low latency implemented by Amazon DynamoDB, Athena, a serverless query service to analyze big data, etc.

3. Increasing Complexity of Hadoop Ecosystem

In addition to the emerging technologies and cloud vendors that continue to offer new services, Hadoop itself has been gradually showing “fatigue”. Building blocks is a good option. However, it increases the difficulty for users to use the components of the Hadoop ecosystem.

Image from Source

As can be seen from the figure above, there have been 13 (if not more) commonly used components in Hadoop, posing a huge challenge to Hadoop users in terms of learning and O&M.

How to face the Post-Hadoop era — tips for practitioners and vendors

Will Hadoop ultimately be abandoned? I believe this will not happen anytime soon. After all, Hadoop has a large number of users, which means exorbitant costs of platform and application migration.

Therefore, the current users will continue to use it, but the number of new users will gradually decrease. This is what we call the “post-Hadoop era”.

For data engineers/practitioners, 3 ways to make the tech transition

In the post-Hadoop era, how should its users face the transition, and what options are available to them? It all depends on how much money you have and your technical capabilities.

First of all, a technology vendor like Cloudera/Hortonworks can’t release a high-quality free product on the market. It turns out that their earlier two-pronged “free version + paid version” approach doesn’t work. Cloudera will only offer the paid version of CDP in the future, indicating the end of the free lunch. It is unknown whether other manufacturers are willing to offer free products. Even if there is such a manufacturer, its product stability and sophistication are yet unknown. After all, the core developers of Hadoop mostly work for Cloudera and Hortonworks.

Image by Author

Second, don’t forget that Hadoop is an open-source project hosted by the Apache Foundation. Apache is designed for the public good, which can be obtained, used, and distributed by the public for free. So if you don’t want to pay for it, there is an option called Apache Hadoop available for free use. After all, a large number of Internet companies still use Apache Hadoop (at their scale, only the open-source version can be used). If they can, why can’t I?

Image by Author

However, open-source software is of average quality, with no service and no SLA guarantees. Users can only find out and solve problems on their own. They have to post questions in the community and wait for the results. If you’re okay with that, then hire a few engineers to try it out. By the way, Hadoop development or O&M engineers are hard and expensive to hire.

Image from Hadoop community meetup

In terms of the potential growth of Apache Hadoop, the above roadmap is taken from a meetup of the Hadoop community. After 3.0, clearly, the new features of Hadoop are not that good anymore. They are mainly about the integration with K8s and Docker, which is not that attractive for big data practitioners.

Image by Author

If the two options above are not to your liking, there may be the last way: abandon Hadoop and migrate to other technology platforms.

For tech vendors, Key Success Factors in the market

How should vendors in the Hadoop ecosystem respond to the new era? The evolution history of Apache Kylin and Kyligence are perfect examples.

Both the Apache Kylin project and Kyligence were born in the Hadoop era. Initially, all Kyligence products ran on Hadoop. About 4 years ago, Kyligence foresaw that customer needs were slowly shifting to cloud-native and the separation of storage and computing.

Having seen such industry trends, Kyligence made a large transformation to its original platform system.

Image by Author

In 2019, Kyligence launched Kyligence Cloud and announced that it has escaped from the Hadoop platform. Kyligence Cloud uses cloud-native architecture at the bottom tier, cloud vendors’ object storage services, such as AWS S3, ADLS, etc. for storage, and Spark+ containerization for computing. Its resources can be directly connected to the IaaS services and ECS on the cloud platform. Kyligence kept expanding to multiple clouds and fine-tuning the architecture, and announced in 2021 that it had merged new technologies such as ClickHouse to the architecture.

The flexibility, maintainability, and low TCO brought by the transformed architecture are huge and have received very positive feedback from the market.

The Key Success Factor(KSF) for tech vendors is to be extremely fast, sensitive, and bold, in both trend-catching and product-transforming.

The big data and analytics market, especially in North America, has become a very hot and competitive market with very dedicated investors that no other industries can compete with. It is never too much for vendors to keep your attention on the market trends, listen to the users and observe their new needs, and keep iterating your products according to these inputs.

More future trends of big data and analytics

Technology will keep progressing, startups with new missions may come and go, and major corporations remain resilient. I’ve written a blog about 7 must-know data buzzwords, and discussed emerging trends in 2022. To keep this one short, I’ll just list some (not all) examined interesting trends, and share with you some nice articles around them:

  1. Metrics Store will rise as the ultimate solution to keep the “single source of truth”.

The missing piece of the modern data stack

2. The design concept of Data Mesh will continue to influence IT decision-makers

Whitepaper: The Data Mesh Shift

3. Data API/Data-as-a-product is an abstract but real demand from the market

Every product will be a data product

Big Data, Cloud, AI, and Data Analytics Predictions for 2022

Thanks for reading this whole blog! If you have any opinions around the topics I mentioned, please leave your comments no matter it’s agreement or disagreement. I’d really appreciate your feedback.

--

--

Coco Li
Kyligence

A previous strategy&management consultant @Accenture and @Gallup. Data, Analytics, and Business Intelligence market&trend researcher, observer, writer.