Most Data Problems are not “Big Data” Problems

Thomas Nield
97 Things
2 min readMay 20, 2019

--

When the “big data” buzzword peaked in 2015, I remember NoSQL, Hadoop, MongoDB, and other unstructured data technologies being touted as the future of analytics. Many organizations started collecting data faster than they could organize and store it, so they simply dumped it on a cluster and scaled horizontally as needed. Many companies put enormous expense into migrating off relational databases like MySQL and onto “big data” platforms like Apache Hadoop.

Amidst this movement, I was teaching an O’Reilly online training on SQL. I had one participant suggest that relational databases and SQL might be a legacy technology. If the lack of horizontal scaling was not enough reason, relational databases have all this pesky overhead to structure data in a normalized fashion, as well as enforce data validation and primary/foreign keys. The internet and connectivity of devices caused an explosion of data, so scalability became the selling point of NoSQL and “big data”.

The irony is that SQL interfaces were added to these “big data” platforms, and this happened for a reason. Analysts found NoSQL languages difficult and wanted to analyze data in a relational data fashion. A great majority of data problems are best modeled as relational database structures. An ORDER has a CUSTOMER and a PRODUCT associated with it. It just makes sense these pieces of information should be normalized in separate tables rather than a blob of JSON. Even better, there’s peace of mind knowing the database software will validate an ORDER and check if the CUSTOMER and PRODUCT are in fact existing, rather than let data corruption quietly creep in due to bugs on the front-end.

The truth is most data problems are not “big data” problems. Anecdotally, 99.9% of problems I’ve encountered are best solved with a traditional relational database.

There are definitely valid cases to use NoSQL platforms, especially when an enormous amount of unstructured data has to be stored (think social media posts, news articles, and web scrapes). But with operational data, relational databases force you to think carefully about how your data model works and to get it right the first time. It is rare that operational data of this nature gets so large a relational database cannot be scaled.

I will leave you with this table to help aid your SQL versus NoSQL decisions:

--

--

Thomas Nield
97 Things

Author of Essential Math for Data Science (O’Reilly) and Getting Started with SQL (O’Reilly)