Big Data was a passing fad for enterprises. What now?

Raj Samuel
Nerd For Tech
Published in
8 min readNov 22, 2021

We’ve heard it time and again. Big data isn’t delivering the value it promised for enterprises. Executives who spent millions on it won’t admit it yet. There are indeed a few web scale companies for which big data technology was not just beneficial but inevitable. This is for the rest of us.

Just to keep us all grounded in reality, Salesforce’s backend is Oracle and StackOverflow runs on 3 SQL Server databases — both relational databases that predate all the hype in the last 20 years.

The past 20 years — how we got here

Big data was the contribution of web2 — the state of world wide web we’re in today led mainly by Google and the FAANG (MAANG?) gang.

Starting in 2003 Jeff Dean, Sanjay Ghemawat, and a few other less known people at Google published a series of well known papers about how bespoke software they built runs Google’s blazing fast infrastructure — GFS, BigTable and MapReduce, among other software like protocol buffers, containers, Borg etc.

The bored developers across the world and other web scale companies picked them up one by one, starting with GFS (which became Hadoop), reimplemented them, and contributed them to Apache.

We called that stack big data.

What went wrong?

Was big data worth it? Google has long detached itself from MapReduce even as enterprises were jumping on the bandwagon. Distributed parallel processing of data could only be achieved with lazy ACID control because, well, life is not fair — you can’t have both data consistency across network of data nodes and availability to all users according to CAP theorem (Google Spanner being an exception — see the last section below).

But a few use cases fit well for this scale-consistency trade-off.

eg. 1. The scale of Google search vs User’s tolerance for slightly obsolete results on a free web search (implemented on top of GFS)

eg. 2. Maintaining the world’s product catalog at Amazon vs Customer briefly seeing a sold-out product as available (via Dynamo — not to be confused with DynamoDB)

eg. 3. Relatively slower retrieval of data of billions of people on Facebook vs Fast writes from billions of people (via Cassandra)

Who made a fool out of the enterprise? The use cases mentioned above are not the kind the enterprise usually encounter — a banking transaction or doctor’s appointment are not good candidates. The enterprise didn’t need big data, they wanted it. Their data wasn’t big — certainly wasn’t web scale. But individuals wanted it on their résumé. Executives started to worry they might be losing out to competitors who adopted it. Every naïve journalist had plastered the word bigdata in articles that remotely mention tech.

The occasional outcry from data experts drowned in the noise. That trend continues today with AI-washing when we call Logistic Regression, Artificial Intelligence.

Developers have a non-negligible role in exacerbating this hype. Application developers that think about data integrity only to pass their Java interviews, fiercely adopted big data and schema-less data storage, finally liberating themselves from the rigid old-school data team. ORMs like Hibernate always unencumbered them from the burden of SQL anyway (and for good reasons).

In their defense, relational databases weren’t horizontally scalable, and good commercial databases were extremely expensive. MySQL and Postgres, the predominant open source databases, were not as feature-rich at the time. That has changed though.

The present — what’s happening now

Clusters that are unwieldy. Companies that invested in big data clusters regret setting it up locally instead of on a public cloud. They deal with cluster upkeep, and regardless of how it’s hosted, frequently encounter Java.lang errors and Spark.context exceptions instead of dealing with direct and clear error messages on data integrity violations.

Layers of unnecessary data stack. A few of these companies are serving data from big data clusters to data lakes and/or cloud databases like Snowflake. The caveat is, data lakes themselves are a by-product of big data where data is stored in object storage systems like S3 and catalogued in SQL compatible big data interfaces like Hive.

So now they have big data (ie. Data Lake) on top of big data (ie. local or cloud based Hadoop cluster). And often another database (like Snowflake or Redshift) for serving analytics. This incurs unnecessary data movement.

SQL is preferred for analyzing structured data. A lot of enterprises are using Spark which has fast in-memory processing and native analytics tools (ML and Graph) built on top of it. But many are using Spark to do just ETL, taking advantage of its fast RDD abstraction, then serve analytics outside of Spark in traditional record-by-record SQL format.

Some of these unfortunate companies were old school enterprises that did not begin with Spark — they spent millions to first convert their SQL databases into RDD or HDFS, and then serve analytics back in SQL. Now that all processing has been moved to (or slated to move to) cloud (and SQL databases on cloud indeed scale well), they are quietly wondering what to do with the underlying contraption of a big data stack.

Should you worry as a big data developer? Technology is sticky. Just look at mainframes. Nevertheless, an array of big data projects to migrate away from that contraption will keep you paid in the upcoming decades. After all, tech industry has been about bundling and unbundling complexity in never ending cycles.

What’s the immediate future?

SQL is soaring back. In the short to medium term future it’s clear that SQL is an inevitable tool, because people want to utilize and analyze data in simple ways. Today, platforms that are touted as bigdata imperatively provide an SQL interface. CockroachDB, once touted as a credible NoSQL threat to MongoDB, careening towards SQL is a classic example.

Contrast that to startups born between 2005–2015 which invariably chose big data platforms no matter the state of SQL on those platforms. That trend is reversing.

Glimpse of Notion app’s data model. https://www.notion.so/blog/data-model-behind-notion

Notion’s data model built on Postgres is a great example of what’s possible on SQL databases even when the schema is unpredictable and can hardly be pre-defined.

Horizontally scaling relational databases seems like a solved problem, very much so at Google Cloud Platform. Several cloud vendors and database vendors have been at it using Paxos distributed consensus algorithm. But what really sets Google apart is their own atomic clocks, time API, and relentless scaling of infrastructure including under-sea cables.

Google Spanner is a globally available relational database with strict consistency (not eventual) and strong ACID. This is Urs Hölzle about Spanner:

NoSQL won’t go away. It’s well suited as a purpose built database that serves a narrow use case — search, key-value lookup, graph algorithms etc.

For instance Snowflake internally manages metadata using a key-value store called FoundationDB despite of being a SQL database. SingleStore, a distributed SQL database, manages it’s index using a key-value store. ElasticSearch or a variant of it is indispensable for fast searches on web apps.

However it’s worth to mention that in the NoSQL family, document databases seem to have an identity crisis.

Stonebraker in Redbook 5th Ed. http://www.redbook.io/all-chapters.html

Most relational databases natively support JSON as Michael Stonebraker (database pioneer and creator of Postgres) correctly predicted over a decade ago, which brings to question the rationale for separate document databases like MongoDB or DynamoDB. Scaling cannot be a deciding factor anymore.

Abundance of SQL tools. SQL’s resurgence is underlined by tools that have started to democratize data. Presto (now Trino) and dbt (data buld tool) are perhaps apt examples.

Presto is a universal SQL query engine that supports querying from object storage (flat files, parquet files etc.), relational databases, NoSQL databases etc.

dbt uses SQL + Python’s Jinja templating for conditional code, to create data transformations from a warehouse or staging area to produce data marts (known as models in dbt) for serving analytics. Plus it moves the documentation on SQL and data, from enterprise Confluence pages to under a single roof with the SQL itself. All of it put together in place by the glue of the cloud: a yml config file.

Big data platforms aren’t disappearing. Distributed processing of unstructured data is all the more required as Machine Learning, a data intensive process, becomes mainstream. So is streaming platforms (eg. Kafka) and Object storage (eg. S3). Spark, although may not be used in the most optimal ways by many, has immense applications in ML, Graph, Batch processing and Streaming.

Data pipeline automation is flourishing. Continuous Integration/Delivery (CI/CD) has been an arcane concept in data. Enterprise DBAs still struggle with this idea and fight it. With tools like dbt, Liquibase, Airflow etc. combined with collaborative development of git platform makes automated data pipelines easy to build. Database development can now be as agile as application development.

Summary

Enterprises followed consumer web companies to adopt big data technologies, driven by many factors, some of which are questionable. After all Google did it, right? But enterprise tech is sticky as hell. They can’t change tooling as quickly as Google or Netflix does.

SQL, the 40+ year old declarative programming language for data, is gaining back the ground it lost in the NoSQL trend. While NoSQL has its place, SQL is deemed an essential tool for data analysis of structured data. Catching up to this reversing trend, there are interesting tools and processes coming in to the data space.

--

--

Raj Samuel
Nerd For Tech

I write because I forget. (PS: if you take what I wrote and post it as your own please try not to edit it and post rubbish. CTRL+C, CTRL+V is your friend.)