NoSQL Jungle: A guided tour in the jungle of NoSQL
NOSQL DATABASES
The term NoSQL, meaning “Not Only SQL”, first appeared in the late 1990s in reference to the new systems built to address the requirements of web and cloud data management and overcome the limitations of traditional SQL technology (see our blog post on SQL vs. NoSQL vs NewSQL for a comparison among different approaches). Those limitations are a lack of horizontal scalability, inefficient data ingestion, rigid schema, and difficult support of complex data such as documents and graphs.
As an alternative to traditional SQL databases, which support the standard relational model, NoSQL systems support data models and query languages other than standard SQL. They typically emphasize scalability (at the expense of consistency), flexible schemas, and practical APIs for programming complex data-intensive applications. To provide scalability, NoSQL systems typically use a scale out approach in a shared-nothing cluster (see blog post on shared-nothing) with replication for availability.
NOSQL WITHIN THE BIG DATA SOFTWARE STACK
There are four main categories of NoSQL systems (Özsu & Valduriez, 2020) based on the underlying data model, e.g., key-value data stores, wide column stores, document stores, and graph databases. Within each category, we can find different variations of the data model (unlike the standardized relational data model) and different query languages or APIs. However, for document stores, JSON is becoming the de-facto standard. There are also multi-model data stores that combine multiple data models, typically document and graph, in one system.
KEY-VALUE DATA STORES AND WIDE COLUMN STORES
Key-value data stores and wide column stores are sometimes grouped in the same category as they both are schemaless and share many features. In the key-value data model, all data is represented as key-value pairs, where the key uniquely identifies the value. Key-value systems are schemaless, which yields great flexibility and scalability. They typically provide a simple interface such as put (key, value), value=get (key), and delete (key). Wide column stores can store rows as lists of attribute-value pairs. The first attribute is called major key or primary key, e.g., a social security number, which uniquely identifies one row among a collection of rows, e.g., people. The keys are usually sorted which enables range queries as well as ordered processing of keys. This is a capability of wide column stores that key-value stores typically don’t support.
Key-value data stores have a distributed architecture which scales out linearly in a shared-nothing cluster. The rows of a key-value collection are typically horizontally partitioned, using hashing or range on the key, and are stored over a number of cluster nodes. Key-value data stores are also good at ingesting data efficiently using SSTables (they claim to use LSM trees but actually use SSTables, see our blog post on B+ Trees, LSM trees and SSTables), a data structure and ingestion algorithm that is very efficient at ingesting data by using single I/Os to insert many rows. Key-value data stores address the rigid schema issue, simply by being schemaless, and each row can have a different structure, i.e., an arbitrary set of columns.
These new features of key-value data stores impact performance. First, SSTables are inefficient for querying data, so the cost of reading data in key-value data stores is much higher than in SQL databases that use B+-trees. Second, schema flexibility implies that the data representation becomes more expensive in terms of space because column names have to be coded and stored within the rows. What is more, the techniques generally used to pack columns of same sizes cannot be applied, so the data representation is more inefficient. Thanks to their fixed schema, SQL databases can organize the columns within the rows in a way that it is very efficient in terms of space and access time.
Finally, key-value stores trade strong data consistency for scalability and availability relying on different ways of controlling consistency such as eventual consistency of replicas, conditional writes, and eventually consistent and strongly consistent reads.
Wide column stores combine some of the beneficial properties of SQL databases (e.g., representing data as tables) with the flexibility (e.g., schemaless data within columns) and scalability of key-value stores. Each row in a wide column table is uniquely identified by a key and has a number of named columns. However, unlike a relational table where columns can only contain atomic values (i.e., a binary string), a column can be wide and contain multiple key-value pairs. Wide column stores extend the key-value store interface with more declarative constructs that allow scans, exact-match, and range queries over column families. They typically provide an API for these constructs to be used in a programming language.