De-coupling of Compute & Storage in Modern Databases

Published in

Wrong AI

3 min readNov 24, 2017

Databases started as integrated storage-compute solutions providing both a durable persistence layer (storage) as well as a query processing engine (compute). Given the growing variety of data types, changing analytical processing needs, and the need to space compute and storage independently, a new breed of databases is emerging that decouples compute and storage i.e., query engines that are interoperable with multiple storage persistence solutions. This post uncovers the evolution of this new breed referred to as SQL-on-Everything engines.

Lets start with a brief history: Relational databases emerged as specialized data management solutions in 1980s. The solution combined a storage layer for persisting/retrieving data, as well as a query engine that translated SQL queries into execution plans. The early years of database era focussed on Online Transaction Processing (OLTP) — data consistency in the form of ACID was critical. These ACID guarantees came at the trade-off of scalability (scale-up instead of scale-out), performance, and availability. An interesting study from Stonebraker et. al. showed that only 4% of the CPU cycles were spent doing real-processing, while 96% was spent in latching, locking, recovery management, and bufferpool processing.

Over the years, the data needs of the application have changed beyond OLTP. Enterprises have increasingly relied on data warehousing to extract insights and actionable reports from the data (referred to as Online Analytical Processing of OLAP). OLAP involved ETL of the data from the operational databases, with complex queries running for minutes and hours in a batch processing fashion. With Web 2.0, the role of databases further expanded beyond managing customer transactions into aggregating a large variety of activity data (clickstreams, logs), monitoring sensors, external data sources (social media) — this information is aggregated continuously to derive actionable insights.

Given the changing requirements, database solutions have evolved to handle structured,semi-structured, and unstructured data.
Irrespective of the specifics of these solutions (traditional, SQL, NoSQL), they all consist of some sort of Storage Engine, and a Query Engine. During the era of relational ACID compliant, the query engines were sophisticated, and storage engines were simplistic single server block drivers or file-systems (scale-up model). In the NoSQL era, the storage layer became sophisticated and distributed, while the Query engine got simplified moving away from transactional & joins to point lookups, range scans, filtering & aggregation.

Today, organizations have a growing number of data silos consisting of OLTP databases, OLAP Warehouses/Data Marts, NoSQL stores, Event-stores, In-memory Data Grids, etc. The challenge today is to actively analyze and correlate data across all the silos (also referred to as the Data Fabric). Further, to avoid the learning curve wth custom data analytics APIs, there is an increasing push towards using SQL as the analysis language. These aspects are being addressed by SQL-on-Everything engines that federate across multiple data stores. Apache Drill, Hive, Presto, Spark, Pivotal HAWQ, Apache Phoenix, Apache Terafodin, Snowflake, AWS Athena & Redshift, etc., are examples of this new breed of de-coupled query engines. In these engines, the schema of the data is persisted in a metadata catalog such as Hive catalog or equivalent. The data is not ETL’ed but rather referenced as an external table. Standardization of on-disk data formats is further helping the SQL-on-Everything trend. In addition to data formats, functionality such as query push-down in Parquet and ORCFile optimize the data processing by returning only the filtered data results to the query engine.

In summary, there is no silver bullet in Big Data solutions. Enterprises will continue to have silos of databases and analytical solutions specialized for specific use-cases. SQL-on-Everything engines will play an important role in creating a data fabric that can “join” data from multiple sources.

Originally published on November 24, 2017.

De-coupling of Compute & Storage in Modern Databases

Written by Sandeep Uttamchandani