SQL vs NoSQL: Choosing the Right Database Model for Your Business Needs

Abhirup Acharya
9 min readJun 14, 2023

--

Understanding the need for NoSQL Databases like MongoDB, DynamoDB and Cassandra

https://www.google.com/amp/s/www.gcreddy.com/2022/10/difference-between-sql-and-nosql.html%3famp=1

Quick brush up about basic SQL

SQL (Structured Query Language) database, also known as a relational database, is a type of database management system (DBMS) that is based on the relational model. It organizes and stores data in a structured manner, using tables with rows and columns to represent entities and their relationships.

SQL databases use a structured query language called SQL to interact with the database. SQL provides a set of commands and syntax for defining, manipulating, and retrieving data from the database. With SQL, you can create tables, define relationships between tables using primary and foreign keys, insert and update data, and query the database to retrieve specific information.

Some popular SQL database systems include:

  1. MySQL → An open-source relational database management system widely used for web applications.
  2. PostgreSQL → An open-source object-relational database known for its extensibility and advanced features.
  3. Microsoft SQL Server → A commercial RDBMS developed by Microsoft, commonly used in Windows-based environments.

Quick brush up about basic NoSQL

NoSQL database, also known as a non-relational database, is a type of database management system that differs from traditional SQL (Structured Query Language) databases.

Popular NoSQL databases include:

  1. MongoDB → A document-oriented NoSQL database that provides scalability, flexibility, and rich query capabilities.
  2. Cassandra → A highly scalable and distributed NoSQL database designed for handling large amounts of data across multiple commodity servers.
  3. Redis → An in-memory NoSQL database that supports key-value data storage and provides high-speed data access.
  4. Amazon DynamoDB → A fully managed NoSQL database service provided by Amazon Web Services (AWS), offering scalability, high availability, and low-latency performance.

NoSQL databases are commonly used in scenarios where there is a need for handling large volumes of rapidly changing data, such as social media platforms, IoT applications etc.

Through this article we’ll look at how databases do and don’t scale. We will look at a few problems of traditional SQL Databases and how the introduction of NoSQL databases solved them. We’ll cover situations that are more suited to be solved through application of SQL databases and others that are more suited towards NoSQL databases.

The problem with Relational Databases

Relational databases work for a wide variety of use cases and are often the de-facto choice for a lot of developers while developing applications. Moreover, relational databases have SQL, which makes it easy to handle new access patterns as well as both OLTP and OLAP patterns in a single database.

OLTP → It stands for Online Transaction Processing. It refers to a class of computer systems and applications that support real-time transaction-oriented workloads. OLTP systems are designed to facilitate and manage the day-to-day operational activities of an organization by processing a large number of short and fast database transactions.

OLAP → It stands for Online Analytical Processing. It refers to a category of computer systems and applications that support complex data analysis, reporting, and decision-making tasks. OLAP systems are designed to handle large volumes of data and provide multidimensional views for interactive and ad-hoc analysis.

OLTP vs OLAP ← This is a wonderful article explaining very clearly what OLAP and OLTP mean along with their differences and provides a crisp comparison as well between the two. Do take a look if you want to dive in further.

However to provide these, Relational databases often suffer from the following problems:

  1. Unpredictability of Query performance
  2. Issue with the all famous JOINS
  3. Difficulty with Horizontal scaling of DB instance

Unpredictability of Query performance

During testing and during the early developmental and release phase your application’s lifecycle, you generally have queries in your application that respond fast.

Representation of RDBMS data store with 5GB data

However, as the size of your tables grow, these operations will get slower and slower. The performance is also affected by other queries that are running at the same time. Simultaneously, if there is analytical operations being done by your company’s data analyst on the same table that is live on production, then things become even more screwed up.

Representation of RDBMS data store with 500GB data

This leads us to the second problem.

Issue with the all famous JOINS

JOINS in SQL can potentially hamper speed due to several reasons:

Large Result Sets → When performing a JOIN operation, the result set can grow significantly if there are many matching rows between the joined tables. This can lead to larger intermediate result sets that need to be processed, transferred, and stored, resulting in increased memory consumption and longer query execution times.

Lack of Index Usage → Efficient JOIN operations often rely on the use of appropriate indexes on the join columns. If the necessary indexes are missing or not properly defined, the database may need to perform full table scans or excessive disk I/O operations to locate matching rows, leading to slower performance.

Complex Query Plans → The database optimizer needs to determine the most efficient way to execute a JOIN query. In some cases, the query optimizer may need to consider multiple possible join paths, table access methods, and join algorithms. This process can become complex and time-consuming, especially for queries with multiple JOINs, resulting in suboptimal query plans and slower execution.

Lack of Selectivity → Join conditions that involve non-selective predicates (conditions that match a large portion of the data) can hinder performance. Non-selective JOIN conditions can result in larger intermediate result sets and slower query execution, as the database needs to process a significant amount of data to retrieve the desired results.

To tackle these issues, you may want to scale up your database which brings us to the next issue.

Difficulty with Horizontal scaling of DB instance

There are two ways to scale a database:

  • Vertical scaling → By increasing the CPU or RAM of your existing database machine(s)
  • Horizontal scaling → By adding additional machines into your database cluster, each of which handles a subset of the total data.

However, you will eventually hit the ceiling of vertical scaling. At this point you may want to consider horizontal scaling. But this presents its own set of challenges.

  • Data Distribution and Sharding → When scaling horizontally, distributing data across multiple nodes is crucial. In SQL databases, data is often structured in tables with relationships and dependencies. Splitting and distributing such interconnected data can be complex, as maintaining referential integrity and ensuring consistent data across nodes becomes challenging. Sharding techniques are often employed to partition data across nodes, but it requires careful planning and may introduce additional complexity.
  • Query Coordination and Joins → SQL databases support complex JOIN operations to retrieve data from multiple tables. When data is distributed across nodes, performing JOIN operations becomes more challenging. JOINs may require coordination between nodes, data transfers across the network, and increased network latency, potentially impacting performance and scalability.
  • Indexing and Query Optimization → SQL databases heavily rely on indexes for efficient data retrieval. With horizontal scaling, indexing becomes more complex, as indexes need to be distributed and maintained across nodes. Query optimization techniques also become more challenging due to the distributed nature of data, potentially affecting query performance.

Horizontal scaling works best when you can shard the data in a way that a single request can be handled by a single machine/node.

The solution via NoSQL Databases

NoSQL Databases try to solve these major problems posed by Relational databases.

  1. Issue with the all famous JOINS
  2. Unpredictability of Query performance
  3. Difficulty with Horizontal scaling of DB instance

Solving JOINS

Let’s quickly understand what NoSQL even tries to solve here.

SQL depends on Normalization for its Data modelling.

Normalization is a process in database design that organizes and structures data to minimize redundancy, improve data integrity, and optimize database efficiency. It involves breaking down data into smaller, more manageable entities (tables) and establishing relationships between them.
It aims to remove redundancy by storing each piece of data in only one place. This benefits us in the following ways:

  1. Flexible Querying → JOINS allow you to reassemble the required data from different tables via a single operation. With the flexibility of JOINS and the rest of the SQL syntactic sugar, you don’t really need to think about how you’ll access your data ahead of time. You model your entities according to the principles of normalization, then write the queries to handle your needs.
  2. Enhancing Data Integrity → Normalization helps enforce data integrity rules by defining constraints and relationships between tables. Primary keys and foreign keys are used to establish relationships and enforce referential integrity, preventing data inconsistencies and ensuring data integrity.
  3. Improving Storage Efficiency → Redundant data, where the same information is stored in multiple places, can lead to inconsistencies and wasted storage space. Storing each piece of data in only one place reduces the chances of data inconsistencies and improves data accuracy.

All these however comes at a price. The price of CPU and Memory; often lots of it.
NoSQL takes the approach that addresses all these issues.

  1. Solving Flexible Querying
    NoSQL databases avoid the need for flexibility in your data access by requiring you to chalk out all the major data access patterns up front and design your database to handle these questions specifically.
    Instead of assembling your data at read time, you will essentially store pre-joined version of your data by laying it out in the way it will be read.
  2. Solving Data Integrity
    NoSQL databases push the responsibility of data integrity into the application. You will essentially need to denormalize and duplicate the data in your database. In the event of a change, there might be a requirement of updating multiple records containing the same data.
  3. Solving Storage Efficiency
    NoSQL databases are less storage efficient than their relational counterparts.
    However, when RDBMS were designed, storage was extremely expensive as compared to compute. When the same is compared to the current scenario, we see that storage prices have dropped like crazy with Compute taking the premium crown. Therefore, it makes a lot of sense to actually model your data to optimize Compute over Storage.
Data Storage example in SQL Database

When this data model is converted to a NoSQL database, let’s say DynamoDB, then it looks like the below.

Counterpart data storage example for NoSQL Database (DynamoDB)

Solving Horizontal Scaling

The primary reason Relational databases find it very difficult to scale horizontally is due to the flexibility of the query syntax. Due to the data access pattern flexibility it’s impossible for the system to know which pieces of your data will be fetched until your query is actually executed. As such, all data needs to be kept local, on the same node, to avoid cross-machine network calls when executing a query.

NoSQL databases embrace Horizontal scaling by requiring you to split up your data into smaller segments and perform all queries within one of these segments.

To understand this a little better, let’s take user_id as a Partition Key and follow through in the diagram.

Most NoSQL databases hash the partition key value before assigning it to a node. This helps distribute the data better.

During read and write operations, all queries must include the partition key in order to have quick access to the relevant node directly.

Solving Unpredictability of Query performance

In SQL databases, as the size of your tables grow, operations will get slower and slower. The performance is also affected by other queries that are running at the same time.

In case of NoSQL databases during a read operation, all queries must include the partition key. While performing a write operation, this operation can be sent to the node that is responsible for that chunk of the data without bothering the other nodes in the cluster. As your data volume increases, you can simply keep on adding additional nodes as and when needed. Every operation hits only one node in the cluster. There is no black-box in terms of scaling as was in case of SQL databases. So, naturally the query performance barely varies in the test and production environments and as your applications scales.

Signing Off!!

Finishing off with the article here. I really hope, I made the article worth your while and you learned a great deal from it.

More than happy to address any improvements and suggestions. Please feel free to drop a comment.

Connect with me via: LinkedIn | Twitter

--

--