Future of NoSQL in Modern Data Stack

Ivan Begtin
4 min readMay 27, 2022

--

Modern data stack is a new concept of interconnected data products. It has a different architecture than enterprise all-in-one data platforms. There is one more topic also interesting in the Modern data approach and it is NoSQL products like MongoDB, ElasticSearch, Redis, Neo4J, ArangoDB, and similar graph and NoSQL databases.

The problem is that most Modern data stack tools are created with two common ideas in mind:

  • everything is a [flat] table
  • SQL everywhere

This problem is not new. Years ago NoSQL products came into popularity as instruments to overcome these limitations. If you would like to avoid SQL DBMS limits you could install MongoDB and upload directly JSON data with complex hierarchical objects, and you could access this data using an internal query language (MQL) or Javascript inside the MongoDB engine.

This approach is very comfortable if you have a lot of data without strict integration rules or if integration is for operational, not analytics scenarios of usage.

But the Modern data stack approach is different, its integration focuses and there are several key products integrated. It’s a data transformation service, orchestration, and observer of data processes.

It’s described very well in Emerging Architectures for Modern Data Infrastructure [1], the most cited text about Modern Data Stack.

And I wee the big problem that most Modern data stack tools do not support NoSQL databases.

Data processing

DBT is the brightest example of Modern Data Stack tools. Right now it’s one of the most common ELT de-facto standard approaches to data transformation. I see more and more products like Jitsu, Meltano, and e.t.c integrated with dbt, even if they have built-in data transformation functions and modules. It happens since users like and request dbt, it’s very popular and the idea of such a data transformation tool was successful.

An idea is to apply data transformation tasks inside the data warehouse using SQL queries. But most NoSQL database engines do not support SQL! Or this support is very-very limited and provided for integration purposes and used with data analytics since most advantages of NoSQL database do not fit SQL query language.

dbt does not support NoSQL query languages and only recently it will support Python language to apply data transformation. But Python is not native to NoSQL too.

Data orchestration and data pipelines

NoSQL databases like MongoDB or ElasticSearch are supported by several data orchestration tools like Airflow, but it’s not common for data orchestration and data pipeline tools. For example, Meltano, a very attractive open-source ELT / ETL engine could extract data from MongoDB but doesn’t support uploading data to MongoDB, and it doesn’t support ElasticSearch for any use. Other modern open-source ELT/ETL tools like Prefect or Dagster don’t ever support MongoDB.

I see that MongoDB is outside of the ETL developer’s vision. It’s considered as a data source sometimes, but not as a data destination. Even if MongoDB is the most popular NoSQL product.

The most simple reason for it is that data upload tasks much simpler with a common interface and SQL is this common interface.

Data catalogs

There are a lot of corporate data catalogs products right now, a lot of them are open-source software and several open standards are under development, like Open Metadata. But data catalog is built on SQL-first ideas. I’ve reviewed about 20+ corporate data catalogs in 2021 and only DataHub by LinkedIn supported MongoDB natively. All others supported SQL-only databases. They didn’t ever support JSON/JSONlines/XML datasets.

Even open standards like Open Metadata support only flat table data descriptions.

But it’s not so hard to get data structure from MongoDB collections but it requires analyzing embedded data structure and to see data not as flat tables, common for SQL databases.

I see that corporate data catalogs get more features to support data lineage, dashboards, data pipelines, and e.t.c. But NoSQL is outside of developers’ priorities.

Altogether it makes the future of NoSQL not so bright. Till you use it for your small products it’s ok, but if your projects grow and you need to build your data stack then NoSQL products are less prepared for integration.

Why did it happen?

I see two primary reasons:

1. Low popularity of NoSQL products [2]. Only MongoDB is quite popular, but others are not and they have very niche usage.

2. Strong marketing efforts of modern cloud SQL databases like Snowflake, Redshift, Google BigQuery, and e.t.c.

What can we do?

Any problem is an opportunity too. Probably if dbt will not support SQL, then other dbt-like products for NoSQL could appear and compete?

Probably advanced SQL connectors for NoSQL databases are something that we miss and it could be a small but growing market.

Or maybe NoSQL is a legacy approach right now and we should consider it for migration to modern cloud SQL databases.

Or maybe something else, please share your thoughts.

Links:

[1] https://future.a16z.com/emerging-architectures-modern-data-infrastructure/

[2] https://db-engines.com/en/ranking

#databases #nosql #sql dbt #moderndatastack

--

--

Ivan Begtin

I am founder of APICrafter, I write about Data Engineering, Open Data, Data, Modern Data stack and Open Government. Join my Telegram channel https://t.me/begtin