Relational and NoSQL databases comparison

Alexey Samoshkin
18 min readSep 30, 2023

--

In this blog post, I want to explore key differences between RDBMS and NoSQL data stores. I will touch on differences in design philosophy and also talk about their distinct features and nuances.

The post consists of two major sections: first explains everything related to relational databases and second covers NoSQL approach. Each section starts with a summary section (aka TLDR).

Relational databases

Overview

  • Based on flat relational two-dimensional model. This model is built according to normalization rules in order to reduce data duplication.
  • Relational model does not require you to be aware of data access patterns and queries upfront when designing database schema. It allows you to make new unforeseen queries without changing schema design.
  • Relational model trade-offs read performance for the ease of writes
  • Strict schema (also called schema-on-write) means that the schema must be predefined upfront before you can store any data.
  • Developer has to bridge a gap and translate between relational and object model in order to retrieve and manipulate data
  • Puts an emphasize on data integrity rules and strong consistency.
  • Provides flexible and powerful querying capabilities based on SQL standard. JOINs, complex aggregations, subqueries, recursive queries are supported. Query shape and size is limited only by the developer’s imagination.
  • Performance of queries is a black box and depends on multiple factors: table size, concurrent queries, presence of indexes, index size.
  • Not designed to scale out due to interconnected data nature and ultra flexible query language. Suitable for vertical scaling only.
  • Supports ACID-compliant transactions that can span multiple tables and rows.
  • Favors strong consistency over availability and latency in replicated scenarios.
  • Can hold a part of business logic on a database side represented by stored procedures and triggers.
  • Has connection-oriented model, that relies on long-lived connections between client and a database; relies on connection pooling mechanism to efficiently manage connections.

Use cases

When to use:

  • For structured relational data with a fixed schema.
  • When data integrity and strong consistency are crucial. This is achieved through strict schema, constraints, and ACID transactions.
  • When your data relies heavily on interconnections and relationships.
  • When powerful query capabilities are needed, particularly if your application often uses joins and aggregations.
  • When you don’t know data usage scenarios upfront, or don’t have primary structure for manipulating data

When not to use:

  • If the relational model isn’t suitable for your data, such as with unstructured or semi-structured data.
  • When you prioritize high write throughput and have low latency requirements.
  • When high and predictable read performance is essential.
  • When dealing with large data volumes, and you need to scale horizontally.
  • If you anticipate regular changes to data structures. Continually altering database schema can be cumbersome and can impact runtime performance.
  • If your data consists of small records with complex many-to-many relationships or if you’re representing data as nodes and edges (graph databases are better suited for this).
  • For large-scale data analytics and processing.

Relational model

The relational data model structures data into tables (relations) and rows (tuples). Each tuple represents a collection of name-value pairs. Columns can reference rows from the same table or another, establishing relationships between data points.

Every SQL operation takes and produces relations, adhering to the principles of relational algebra. This uniformity puts some constraints on the data model. Specifically, table structure has to be flat, meaning tuples don’t accommodate nested structures like records or lists. In a nutshell, relational model can be viewed as a set of flat two-dimensional tables.

Relational model focuses on data integrity rules. It ensures that relationships between tables remain consistent (called referential integrity): if a foreign key in one table matches a primary key in another, the corresponding record must exist. This prevents orphaned records and ensures that interlinked data accurately represents real-world associations and remains reliable throughout its lifecycle.

It strives to reduce data duplication and data anomalies through a process called normalization. Normalization is a set of rules on how to break down larger tables into smaller eliminating any repetitive or duplicated data. This process conserves storage; but what is more important, it makes easier to update data, as any data changes need to be made in just one location. While it’s easier to update data, reading the data which is scattered across tables will require extensive use of JOINs and will impact query performance. In other words, relational model trade-offs read performance for ease of writes.

The relational model differs from the in-memory data structures used in applications. Developers have to translate data between object and relational representations to address the mismatch between these two worlds. ORM frameworks often handle much of this heavy lifting, but they can introduce extra complexity on their own and sometimes abstract the developer too much from the underlying database. This can result in complex, large queries being generated by the framework without the developer even realizing this.

Strict schema

SQL databases have strict schema, also called as schema-on-write. Before you can store some data, the schema for it must be predefined. The schema dictates the structure of the database, detailing the tables, their respective columns, and the data types each column can contain.

The goal of strict schema is to enforce data consistency and integrity rules. Also it acts as sort of documentation, so one can quickly grasp the data model from the table definitions without having to extract implicit knowledge of database structure from application code. Finally, since database knows about the structure of the data upfront it can store it more efficiently.

There’re several drawbacks though. Single predefined schema demand data uniformity (all records within a table must have the same structure), and it becomes challenging to accommodate non-uniform or semi-structured data. The strict schema is inflexible schema, which means less agility. As data requirements evolve, making frequent changes to schema can be cumbersome, slowing down the development time and having negative runtime effects such as keeping table-wide locks for the duration of the “ALTER TABLE” operation.

Querying

SQL isn’t just another database language; it has become an international standard and gained a wide adoption. Different vendors have their SQL dialects, but the core essence remains consistent across them.

SQL comes with ultra powerful querying capabilities like complex joins, aggregations, subqueries, recursive queries. With JOINs you can easily fetch multiple, heterogeneous items from a database within a single network round-trip.

You can go ahead and create a normalized data model without thinking about access patterns upfront. Then you can fetch and analyze data in many different unforeseen ways without needing to modify the database schema design to adopt a new query pattern. This makes relational databases the right choice for applications or startups when you don’t know data usage scenarios upfront. Also, the flexibility of querying makes SQL databases suitable for OLAP operations, where multi-dimensional analytical queries are used to provide answers to complex business questions.

Performance

The downside of flexible querying is the fact that queries are relatively expensive and don’t scale well in high-traffic and high data volume situations.

Operations like JOINs inherently have a poor time complexity because JOINs are computed by a database at a query time. Aggregations are heavy as well and often need to traverse through vast amounts of data. As the size of your tables grow, these operations will get slower and slower.

Also, there’s no inherent limit to the amount of data you can scan in a single request to your database, which also means there’s a risk that single misconfigured or inefficient query can consume disproportionate resources and stall other concurrent operations.

Indexes are commonly used to speed up queries. The most common type of index in a relational database is the B-tree index. When you create an index on a column, the database builds a B-tree data structure on that column’s data. The database then uses this tree structure to quickly locate rows containing specific column values. Other types of indexes are usually supported: composite indexes, unique indexes, bitmap, hash indexes. Relational databases come with query planner that automatically chooses the right index for the given query, so developer does not have to specify index explicitly at a query time.

To conclude, with relational databases, performance is like a black box, and is a subject to multiple factors: table size, index size, concurrent queries, presence and design of indexes.

Scaling

Relational database is not designed to scale out and is not designed to be run on clusters due to highly interconnected data.

Horizontal scaling works best when you can shard the data in a way that a single request can be handled by a single machine. The flexibility of the query syntax makes it impossible for the database to know which pieces of data will be fetched until the query is executed. It’s unclear how to partition the data so that executing a single JOIN operation doesn’t necessitate the database to fetch data from multiple servers, triggering numerous network requests. As such, all data needs to be kept local, on the same node, to avoid cross-machine network calls when executing a query.

Also, partitioning the data will break other guarantees that are inherent to relational databases, such as referential integrity and ACID transactions.

Relational databases are suitable for vertical scaling. Nevertheless, there are usually some limits on a maximum number and a size of data files used under the hood, which imposes upper limits on scalability. And when talking about high volumes of data, it will be impossible to host all data on a single machine anyway.

Strong consistency

Ensuring data consistency and integrity is one of the key features of relational databases. Data integrity is enforced by strict schema, explicit data types, constraints and other mechanisms like triggers. Referential integrity is enforced via foreign keys: this ensures that relationships between tables make logical sense and prevents any anomalies due to orphaned records.

Transactions abstracts the complexity of concurrency from application developer. All concurrent access to data is controlled through transactions.

Transactions are defined by the ACID properties. Atomicity ensures that operations within a transaction are all-or-nothing. A transaction will either fully complete, with all its operations successfully executed, or if any part of it fails, none of the operations within the transaction are executed, preventing any half-done operations. Consistency guarantees that the database remains in a valid state after a transaction, with all relevant consistency checks (outlined in a paragraph above) being validated upon the transaction’s completion. Isolation addresses the behavior of the database when concurrent transactions execute at the same time. Durability assures that once a transaction has been committed, the changes are permanent. Even in the face of system failures, the database will ensure that all committed transactions remain alive.

Replication technologies are limited and typically choose consistency over availability in face of network partition (referring to CAP theorem). They traditionally perform synchronous replication in order to provide a strongly consistent data view.

Connection model

Relational databases are connection-oriented, meaning that a dedicated connection is established between the client and the server for the duration of a session. If a wire protocol used to talk to a database does not support pipelining, then each connection is exclusively locked for the duration of a query. This means that a database client has to wait for a reply to a previous query before it can reuse the same connection to send a next query. Moreover each connection consumes server resources and establishing new connection comes with an overhead (TCP and TLS handshakes, database authentication, etc).

To mitigate these issues, the concept of connection pooling is often employed. Connection pooling involves maintaining a cache of database connections that can be reused by multiple clients. Instead of establishing a new connection each time, clients retrieve an existing connection from the pool, use it, and then return it to the pool. This approach ensures reduced latency and better scalability, especially in applications with high number of concurrent online users.

Note, if you have a highly concurrent environment with long queries and a non-pipelined protocol, then each connection from a pool will be exclusively locked to serve a particular query, and each new incoming request will spin up a new connection to the database. After some time you will reach the limit of maximum allowed connections. In this situation connection pool mechanism is not going to help you much.

Connection-oriented model works fine for long-running compute instances, but fails to meet the nature of server-less compute environment, which consists of ephemeral short-lived execution environments.

Integration solution

Relational database can act as an integration/interoperability mechanism to connect multiple applications: shared integration database.

Relational databases remain the backbone of enterprise IT systems. They are often used as an integration point between various applications and platforms. Standard SQL interface ensures that multiple systems, regardless of their vendor or underlying architecture, can interact with a relational database seamlessly.

Keep in mind though, that shared integration database is viewed as an anti-pattern according to modern standards.

NoSQL databases

Overview

  • Designed with scalability and large data volumes in mind, influencing many other design decisions, including data model, query capabilities, and consistency guarantees.
  • Doesn’t restrict data to a one-size-fits-all relational model. Instead, NoSQL database provide various specialized data models optimizing for specific use cases, including key-value, document, column family, graph, and time-series structures.
  • Favors aggregate-oriented data models, which consist of large records with simple connections between them. Referential integrity is not supported.
  • NoSQL modeling mindset requires you to map out your query patterns up front and then design your NoSQL schema around it. Changes in query patterns might necessitate a schema redesign, making it challenging to accommodate unforeseen queries in the future.
  • Data is often denormalized in order to retrieve related datasets with a single request from a single database node. The responsibility of managing duplicated data is usually pushed to the application side.
  • The responsibility of enforcing data integrity rules is shifted to the application side.
  • For the sake of scalability and performance, querying capabilities are intentionally limited. Operations like JOINs and aggregations, and any other operations that span multiple aggregate boundaries may be unsupported or severely restricted.
  • Loose schema. Does not require you to define schema upfront. Allows records within same table to have varying non-uniform structure.
  • Transactions, if present, typically support atomic operations only against a single document. Otherwise transactions are not fully ACID compliant.
  • NoSQL databases usually relax strong consistency guarantees in scenarios where data is replicated across multiple nodes.

Use cases

When to use:

  • When the data schema is rapidly evolving or the data structure isn’t predefined, a schema-less database offers enhanced agility and flexibility.
  • When dealing with nonuniform, semi-structured, or unstructured data.
  • For data models of records with simple or limited connections, so it’s easier to draw boundaries between aggregates.
  • When working with data models that don’t fit into a relational approach.
  • For sparse data. Column-family stores are well-suited for datasets where many attributes have null or missing values, as they don’t use up space for null entries.
  • When you anticipate large amount of traffic or large data volumes that would be best distributed across servers. When horizontal scaling is the must.
  • In scenarios demanding a high volume of write operations.

When not to use:

  • When you need strong consistency and ACID-compliant transactions.
  • When complex querying is needed.
  • When data access patterns are unknown upfront or when you need to support unforeseen queries of varying shape in future (e.g OLAP systems).
  • When your data fits best in a simple table format or if you have many relationships between your data records.

Non-relational data models

NoSQL databases don’t constrain data into two-dimensional tables. They permit complex nested tree-like objects. This eliminates the need to translate between in-memory and database representations and simplifies database access for developers.

There is no universal data model across all NoSQL databases. Instead, you can select one that best aligns with your needs: document-based, key-value, column-family, graph-like, time-series. The strength and weakness of these models lie in their specialization. If used outside of their intended context, they can become overly complex and inefficient. In contrast, with the relational model, you can model a variety of problem domains: key-value data storage, hierarchical relationships, highly interconnected data with prevalent many-to-many relationships, time-series data. It might be more complex and not as efficient solution, but it will nevertheless work.

Aggregate-oriented data model

Most NoSQL databases were designed with horizontal scalability in mind, inspired by the necessity to operate on clusters. This led to the development of aggregate-oriented data models consisting of large records with simple connections as opposed to highly interconnected set of small records that relational model promotes.

“Aggregate” term comes from domain-driven design and represents a collection of related objects that we wish to treat as a unit. Data is read and written as a single unit, which is more efficient for distributed systems. So aggregate is a unit of data retrieval and unit of consistency: we like to update aggregates with atomic operations and communicate with our data storage in terms of aggregates.

NoSQL databases are optimized for accessing or updating single document/aggregate at a time. On the other hand, SQL databases can efficiently manage operations across multiple rows and tables.

What else does aggregate-orientation mean in practice?

  • Lack of JOIN Operation. In relational databases, the JOIN operation allows you to combine rows from two or more tables based on related columns. In NoSQL databases, there's no equivalent to this operation (or very limited surrogate). This means you may have to denormalize data, embedding relevant data directly within a single document or record, leading to potential data redundancy.
  • NoSQL databases have limited support for aggregations. While some basic operations like counting or sum might be supported, complex aggregations will require additional work or may not be feasible at all.
  • Inter-document relationships are more difficult to handle than intra-document relationships. Establishing relationships between documents is tricky. You might use reference IDs, much like foreign keys in SQL, but the database does not enforce referential integrity checks.
  • When transactions are supported, they are usually constrained to a single document. Atomic updates within a single document are possible, but operations that span multiple documents are not atomic. It might be a tough time when you have a business transaction that spans multiple documents and you need strong consistency guarantees.

Choosing appropriate aggregate boundaries in NoSQL databases depends entirely on your anticipated data access patterns. The advice is to group data that is frequently accessed together to optimize retrieval speeds. While a particular decomposition schema can help with specific interactions, it might hinder others. If your querying requirements evolve, you could find yourself having to redesign your schema to accommodate them. In contrast, the relational model is very flexible and allows you to easily look at the data in different ways without changing the schema design. The relational schema is primarily dictated by normalization rules and is not tied to specific access patterns.

Denormalization is more common with aggregate-based approach, because we want to minimize the number of aggregates we access during a data interaction. Denormalized store allows you to retrieve all information about a specific data item in a single request.

However, denormalization comes with its own challenges, notably data duplication. NoSQL databases sidestep this issue by pushing the data integrity responsibilities onto the application side. Developers may find themselves having to update the same piece of data in multiple locations when change occurs.

Data is designed mostly for read purposes, rather than for ease of storing and write flexibility.

Loose schema

NoSQL databases allow you to store records without first defining any schema. Multiple records within a same table might have different structure: there’s no requirement for all records of a table to have uniform structure.

However, the absence of schema is just an illusion. Even when you don’t have to define explicit schema, there’s typically an implicit schema around. This implicit schema is a set of assumptions about the data’s structure present in the application code that manages this data. Essentially, a schema-less database moves the schema definition to the application code, and often this knowledge is scattered all over the codebase (or is at least confined to data access-related modules).

Implicit schema gives you more agility. As application requirements evolve, the table or collection can adapt records with new data structure without the need for time-consuming database refactoring. This reduces the challenges associated with schema migration and the lock-step rollout of new app and database versions. On the flip side, now the application logic has to deal with multiple versions of the same data entity being stored at the same time.

Another advantage is the ease of handling non-uniform data, where each record can have a distinct set of fields.

The disadvantage is that without the schema defined on the database side, the database can no longer enforce consistency checks and validate data, raising the risk of accepting inconsistent data. This concern becomes more important when multiple applications, possibly developed by different teams, access the same database (a scenario known as the shared integration database).

Also note, that the flexibility of implicit schema only applies within an aggregate — if you need to change your aggregate boundaries, the migration is as complex as it is in the relational case.

Querying

Query capabilities in NoSQL databases are typically more restricted than in RDBMS. Data can be efficiently queried in a limited set of ways; outside of these, queries can become either very inefficient or are not supported at all.

Inter-aggregate JOINs and aggregations are not supported deliberately. This is to minimize the number of aggregates (documents) query can access for the purpose of horizontal scaling. When JOIN operations are necessary, they’re usually handled in the application layer.

There’s no common standard language for querying like SQL. Each NoSQL database comes with their unique query language.

Querying is not as flexible as in RDMBS. With RDBMS, you can create a normalized data model without thinking about access patterns upfront. By contrast, in NoSQL, you shouldn’t start designing your schema until you know the questions that it needs to answer. Understanding the business problems and the application use cases up front is absolutely essential. You won’t design your tables first and then see what questions you need to answer. You map out our query patterns up front, then design your NoSQL schema around it. You have to do the hard work up front to really think about your data access patterns and how you will satisfy them.

Rather than reassembling your data at read time using JOINs, NoSQL database require you to store data in a such way that “pre-joins” and keeps related data together. NoSQL database organizes data so that its shape in the database corresponds with what will be queried.

You design your schema specifically to make the most common and important queries as fast and as inexpensive as possible. Your data structures are tailored to the specific requirements of your business use
cases.

NoSQL database trade-offs write simplicity for read performance. While data might be structured in a way that makes it hard to update and manage it, this schema will give you good performance for a selected subset of queries.

Scaling

NoSQL databases were designed to handle horizontal scaling and manage large datasets with high ingestion throughput from the day zero. This design goal has a strong influence on other aspects, such as data model, querying capabilities, consistency and durability guarantees.

To achieve this, they remove features traditionally found in relational databases that can hinder performance, such as joins and aggregations. Query capabilities and flexibility are deliberately sacrificed in favor of scalability and performance.

Additionally, these databases implement explicit partitioning to distribute data across multiple storage nodes effectively. It forces you to segment your data (with explicit partition/shard key), allowing for easier horizontal scaling. All queries are then confined to this segment. This partitioning mechanism is what allows to scale almost infinitely without performance degradation. As your data volume increases, you can just add additional nodes as needed.

Consistency

With scalability in mind, most NoSQL stores lack true ACID transactions and favor relaxed consistency model.

Usually, NoSQL stores don’t have transactions that span multiple aggregates. Instead, they provide atomic manipulation of a single aggregate at a time. The aggregate acts as the atomic unit for retrieval and updates. This provides somewhat limited amount of transactional control.

In scenarios when data is replicated across multiple nodes, application developer has to deal with a new class of anomalies. In a strongly consistent system (one that has linearizability property to be more correct) reads will always return the most recent version of the data, and writes will be visible to all nodes immediately after they occur. In distributed systems this is no longer true by default and reads might not always reflect the most recent write. Various consistency models (causal consistency, read-your-writes, monotonic reads, etc) describe the guarantees provided by the system about the order in which write operations appear to occur to clients. And the developer has to be aware which consistency levels are provided by the particular database, understand which level is needed for the particular use case or feature, and therefore make an informed decision.

In face of network partition, NoSQL stores allow you to trade between consistency and availability. In normal operation mode, they allow you to trade between consistency and latency. Some database are opinionated enough and stick to a particular side of spectrum, while others give you more control of the desired consistency level. This can be a database-wide setting or can be tweaked on a per-query basis.

Conclusion

Recently, there has been a convergence between relational databases and NoSQL solutions. Relational databases are adopting aspects previously exclusive to NoSQL solutions only, such as supporting complex nested data types like XML and JSON, focusing on specialized data models and use cases like time-series data, introducing geospatial indexes. On the other hand, NoSQL solutions are now incorporating multi-document transactions, advanced aggregation operations, and offering the capability to define validation schemas within the database.

It’s essential to view those two categories described in this post not as a binary choice but as endpoints on a spectrum. Real-world database solutions will often lie somewhere in between these points, leaning more towards one end or the other.

While the line that distinguish relational from NoSQL databases seems to be blurring, it’s crucial to understand that this apparent similarity can be deceptive. It’s important to be aware and remember the original design goals of each solution, the purpose for which each database was created, and the specific problems they are meant to address, and employ them accordingly.

--

--