The new data estate, a story of post Big data hype era in Cloud — part 1 The history

Published in

MyCloudyPuzzle

7 min readJan 30, 2019

I have been very lucky to be in to data industry right at the time when Data Warehousing was at its peak and experience the pain of setting up, securing and using Hadoop in enterprise environments. Finally first hand witnessing the rise of Cloud computing and modern data era!
With the power of Cloud for the first time big data processing systems are mature enough to solve enterprise data problems rather than adding to it and becoming a burden! In this series of post I will be going through all this as a segue to what problems cloud has solved and how to take most advantage of it!

If you could travel a couple of decades in time, you will find yourself in the Data Warehousing era when all companies were heavily investing in integrating their operation data silos in bigger databases called a data warehouse. Data warehouse was being built (and still is being built) in order to achieve some key outcomes:

To store history of changes happening in the operation systems. The most common(or cliché) example of this is when a customer changes address. In a operation system the customer address gets updated without any attribution to what was the previous address and frankly the operation system (for instance a billing system) wouldn’t care about what was the old value (All the billing system needs is a valid current address to generate and send invoices and bills to) . But as the data collection has matured and the value of data has been realized more in organizations, the analytics professionals and business users started to identify valuable insights in these changes and hence a motivation to store them somewhere (a.k.a DW)
To integrate information from various silos of operational systems (ERP, CRM and etc.) to a single store know as Enterprise Data Warehouse (EDW) and reveal insights in their data which was not was not possible to find from data sources in isolation.
Re-model the data in way to fit reporting and analytical work loads (write once, read always). Most commonly data in operational databases are stored in 3rd normal form but in data warehouses dimensional modelling or star schema is a more common pattern. The most recent methodology developed by Dan Linstedt is called Data Vault.

Remember that all this has happened 1990s (Kimball’s The Data Warehouse Toolkit was published in 1996) when memory used to cost around $30 per MB versus $0.0068 in 2018 (reduce more than 4000 times), disk costing $.20 per MB versus $0.0000245 in 2018 (reduced more than 8000 times!!) and processing being much much more limited and expensive. The reason for mentioning all this is to paint a picture of why a lot of this practices has developed to cope with the hardware limitations and making the bang for the tech buck go as far as possible at the cost of design complexity and many other compromises.

At the same time hardware and software vendors (like Microsoft, Oracle, Teradata and etc.) were also hard at work optimizing their DBMSs with the most powerful hardware and most efficient software.

The holy separation!

Years passed and in early 2010s computing got cheaper and cheaper to a point where commodity computing became reliable and efficient enough to handle enterprise workloads. And with the help Open Source Software (the power of community) processing data on clusters of these machines became possible and this was the born of Hadoop.
Although Hadoop’s main agenda was processing bigger volumes ,velocity and variety of data, it introduced the amazing concept of “separation of storage and compute” to data processing which later made quite a significant uplift in related technologies (In my opinion this was the most valuable achievement of Hadoop!). Initial Hadoop was nothing but HDFS(storage) plus MapReduce (processing). This separation made it possible for numerous other eco-system projects including Apache Spark to rise and shine and without needing to reinvent the wheel again.

What happened to Hadoop?

If you even haven’t been following the tech trend, for sure noticed the Cloudera and Hortonworks merger and that by itself is proof that the Hadoop market is not big enough for two players anymore! I think the below quote from Mathew Lodge summarizes what has happened strongly.

Mathew Lodge, Anaconda‘s senior vice president of products and marketing, pointed out in a story on VentureBeat how the center of the big data universe shifted away from Hadoop to the cloud, where storing data in object storage system like Amazon‘s S3, Microsoft Azure Blob Storage, and Google Cloud Storage is five times cheaper than storing it on HDFS. Mathew Lodge, Anaconda‘s senior vice president of products and marketing, pointed out in a story on VentureBeat how the center of the big data universe shifted away from Hadoop to the cloud, where storing data in object storage system like Amazon‘s S3, Microsoft Azure Blob Storage, and Google Cloud Storage is five times cheaper than storing it on HDFS.
https://www.datanami.com/2018/10/18/is-hadoop-officially-dead/
“It’s the Sears-K-Mart merger,” Teradata COO Oliver Ratzesberger told Datanami this week at the vendor’s user conference. “It’s the only way for them to potentially survive this. Hadoop in itself has become irrelevant.”
https://www.datanami.com/2018/10/18/is-hadoop-officially-dead/

There were certainly more to why enterprises started to wash their hands off Hadoop.

Hadoop by itself was a community and academic effort to make cluster computing using commodity hardware a possibility. In reality, the big enterprise organizations who carry the primary financial burden of such technological development wanted a robust, secure, reliable, highly tested system similar to what they already had with DBMSs. Tremendous amounts of effort has gone in to making Hadoop such a product but the nature of open source projects usually are to develop cutting edge products organically(power of community with least central management) rather than systematically.
Again Hadoop being a community initiative, different eco-system projects was developed using different technologies (main components in Java, Hue in Python, Impala in C and etc.). This meant integrating these components with each other and with every update or patch making sure they remain compatible, was a huge burden on organizations.
Fully securing a Hadoop implementation and integrating it with the enterprise identity solution (Microsoft AD for instance) required a tremendous effort. (My experience of closely working with all three primary vendors of big data was always a bigger quote for securing the cluster than installing it!)
The main agenda of Hadoop was to enable big data processing on commodity hardware and hence democratizing the data processing technology; but large organizations prefer dealing with established hardware vendors than managing commodity hardware in their own DCs.
Technology was progressing so rapidly, that the traditional organizations were failing to keep up with human resource requirements and organizational process changes. (There was a period between 2016–2017 which new Apache projects used to appear almost weekly and replacing another product. keeping up with the tech was near impossible for a tech savvy individual let alone a massive organization.)

Obviously Hadoop offered lots of extra capabilities over
traditional relational databases and could solve various problems that an RDBMS was not capable of handling (Handling unstructured or semi structured data, schema on read, cheaper storage) but when I think about it the typical enterprise (the big bank or utility provider) operates in a very different way to the digital era companies like Netflix, Facebook, Ebay or etc.

Summary

Putting it all together, Hadoop implementations ended up being seen as large and lengthy projects involving hardware and software investment for a slower, less secure, less reliable and too rapidly changing data store compared to the traditional EDW. Unfortunately in majority of organizations the business processes and use cases were not mature enough to take advantage of all the capabilities of a platform like Hadoop and even if it was the cost and risk often out weighted the benefit. but most importantly it was the appearance of public cloud data platforms that made the decision makers and technologists start thinking beyond Hadoop. The below quote from Alex Robbio, the President and Co-Founder of Belatrix Software, in an article in Forbes is a great summary:

Businesses today have a range of options for handling and analyzing large amounts of data, particularly on public cloud platforms. We’re also seeing businesses wanting to move to serverless architectures, where an application can use backend-as-a-service and function-as-a-service functionalities provided by third parties. And here there is the added advantage: that businesses just pay for the compute power and storage that they use.
https://www.forbes.com/sites/forbestechcouncil/2018/11/16/are-we-nearing-the-end-of-hadoop-and-big-data/#5fef65a74e04

Finally my intention of writing this was not to discuss the future of Hadoop but more to set the context on what has accelerated the adoption of Cloud and how Cloud is trying to fill in the gaps of the previous generation of modern data platforms.

Despite all the above, Hadoop remains a core technology for many enterprises. Together Cloudera and Hortonworks will be able to offer customers a more comprehensive set of services and offerings, such as an end-to-end cloud big data offering and support for more complex deployments. However, the technology world continues to move quickly, and many businesses will already be starting to look beyond the Hadoop technology. It’ll be a fascinating space to watch in the coming years.
https://www.forbes.com/sites/forbestechcouncil/2018/11/16/are-we-nearing-the-end-of-hadoop-and-big-data/#5fef65a74e04

Originally published at mycloudypuzzle.azurewebsites.net on January 30, 2019.

The new data estate, a story of post Big data hype era in Cloud — part 1 The history

The holy separation!

What happened to Hadoop?

Summary

Written by Mehdi Modarressi