SQL vs NoSQL: 4 Aspects You Need to Know
With simple analogies that even a 5-year-old could understand
If you have experience working with databases, chances are you have heard of SQL and NoSQL. However, it might be a hard decision trying to choose between the two for your application, especially when you are flooded with technical jargons on the internet which might not even be relevant for you! In this article, I am going to highlight the 4 simple aspects you need to consider when deciding between SQL and NoSQL. Along the way I will also introduce a few important concepts accompanied with simple analogies, which I think might help in your decision-making!
Brief Introduction of SQL and NoSQL
SQL stands for Structured Query Language and it is a programming language that is used to manage and access your Relational Database Management System (RDBMS). In RDBMS, data are organized into relations (tables), where each relation is an unordered collection of tuples (rows). Most RDBMS uses SQL, hence RDBMS is also known as SQL databases.
NoSQL stands for Not Only SQL, which can mean any other types of databases other than RDBMS. Some popular ones include document database (MongoDB), key-value store (Redis), column-oriented database (MariaDB) and graph database (Neo4j).
CFLS - The 4 aspects you need to know to choose the right database
These are the 4 most important aspects that I think you should know when deciding between SQL and NoSQL! I called it the CFLS:
- Consistency
- Flexibility
- Locality
- Scalability
Consistency
Does your data need to be consistent at all times?
If your database is going to store critical data like payment / account balance / inventory of a product on an e-commerce site in which any inconsistent view of the database may result in user panicking or business losses, you would probably want to go with a SQL database.
A simple example will be when you make a bank transaction to your friend, it will be unacceptable if the amount has been deducted from your balance but it did not reflect on your friend’s account immediately. This is going to hurt the credibility of the bank substantially.
Strong consistency VS Eventual consistency
SQL databases offers strong consistency — it ensures that all reads made to a certain data item will not return an inconsistent state. On the other hand, NoSQL databases generally offers eventual consistency, which guarantees that if no new updates are made to a data item, eventually all reads to that item will return the latest value. However, it did not guarantee how long it will take to achieve the consistent state.
Here’s a simple analogy. Peter and Jane are a couple and they decided to have a notebook each to record how many apples they own. Let’s say they both started with 10 apples each, and Peter decided to give Jane 5 apples on Valentine’s Day. However, Jane is on a vacation with her family and did not bring her notebook with her.
If they are using a NoSQL approach, Peter would first update his notebook, and Jane would only update her notebook when she is back from her vacation. If someone reads their notebooks while Jane is still on vacation, her notebook will show that she has 10 apples while Peter’s notebook will indicate that he only has 5 apples left — a total of 15 apples, which is inconsistent to the initial state of 20 apples!
What if they decided to use a SQL database approach? In this case, Peter would update his notebook, while asking Jane to make sure that no one else could read her notebook until she is back. This is to prevent someone from reading the updated value from Peter’s notebook together with an outdated value from Jane’s notebook, which results in an inconsistent state as illustrated in the NoSQL approach. In a SQL database approach, a read could only be made to the notebooks after the updates are performed on both notebooks, where Peter’s notebook will record that he has 5 apples, and Jane’s notebook will show that she has 15 apples, a total of 20 apples before and after — a consistent state.
However, if your database has only a single node*, you will not need to worry about consistency issues. This is because the read and write operations are carried out on the same node. To reuse the analogy above, let’s say Peter and Jane decided to only use one notebook to record the number of apples. Reads can only be made to the notebook either before or after the update, hence it can’t possibly return an inconsistent state.
What is a database node?
Node is a running instance of the database. Usually the data in the entire database system are divided across the nodes, for example, node 1 may store User 1–5 while node 2 may store User 6–10. This is also known as horizontal partitioning or sharding.
You can imagine this as having a bookshelf (your database system), where each compartment is a node. The first compartment is used to keep books that have title starting with “A” whereas the second compartment is used to keep books with title starting with “B” and so on.
However, you can also use multiple nodes to store duplicate data in case one of the nodes crashes. For example, node 1 and node 2 can both store data of User 1–10. If node 1 goes down, we still can access the data from node 2.
Flexibility
Do you want to enforce a structure for your data?
If your data have a well-defined structure and you don’t foresee any possible changes to the structure in the future, SQL databases might be a good choice to enforce the structure. On the other hand, if the data does not have a definite structure or you have no control over the structure of the data, then NoSQL databases might be more suitable for your use case.
Schema-on-write VS Schema-on-read
An example of well-defined structured data is a bank transaction. It must have certain attributes such as amount, from_account, to_account and timestamp, and you probably wouldn’t want a bank transaction record to have some random attributes. In this case, SQL databases might just be the right tool as every data that is written to the database must conform to the pre-defined schema. This is known as schema-on-write.
On the other hand, NoSQL databases are generally the preferred choice for the reasons below:
- It is hard to define a uniform structure across all data items.
Log: Each log may have a different structure — an error log which prints stack trace is different from a usual log which records normal operations. - You want to store heterogenous types of data.
Files: Your application allows users to upload files. However, there is no restrictions on the type of files — it can be a .csv / .mp4 / .txt / .docx etc. - You want to have the ease to change the schema in the future without breaking the old ones.
Let’s say you have an existing database which is used to store chat messages. One day, you suddenly have the thought of adding a new feature which allows users to react to a specific message by adding an emoji (like the one in Facebook messenger). If you are using a SQL database, you will need to use theALTER TABLE
command to add a new columnemoji
to yourChats
table, and theALTER TABLE
command is notorious for the amount of time it might take, especially on large datasets. (Refer here for more information)
On the contrary, if you are using a NoSQL database, you can start writing the new chat messages which has an emoji tagged to it into the database straight away! There is nothing that needs to be done whenever you want to add/remove an attribute!
While NoSQL databases are usually known as schema-less, a more accurate term will be schema-on-read, as the data that is being read must has a certain structure, it’s just that the structure is not enforced when the data is being written to the database, instead, the structure is only known when you are reading it from the database.
However, if you are using an object-oriented design in your application with a SQL database, bear in mind that you will need to introduce an ORM* layer between the database and your application code.
What is ORM?
ORM stands for Object-Relational Mapping. It is exactly what its name suggests, mapping from the records in your relational database (SQL database) to the objects in your application code and vice-versa. For example, in your database you might have a table named “User” which has columns “name” and “age”. In order to use it in your application code, you will need a tool to help you transform the rows in your database to an actual “User” object in your application that has “name” and “age” attributes.
Locality
Does your data always appear together?
If certain data items are always retrieved together, you might want to use NoSQL for better data locality.
Let’s use Facebook posts as an example. A Facebook post must be immediately followed by the comments on that post. If you are using SQL databases, you might create two tables — Posts
and Comments
, where the Comments
table will need a column to store the unique identifier of each post, as a foreign key to the table Posts
(so that we know which post this comment belongs to). Hence, in order to display all the comments on a specific post, we will need to do a Join
between these two tables. It becomes more troublesome when the tables are on different nodes. All these operations make a simple request takes much longer, greatly affecting user experience.
On the contrary, if you are using a NoSQL database, you can spare yourselves from all these hassles! Let’s say you are using a document database with JSON format, you can save each post as a JSON object with a comments
property, which stores a list of comments made on that post. Whenever you want to display a post, all you need to do is request the post
object, and it will include all the comments which belong to that post! Faster processing time = Better user experience!
Scalability
Are you expecting your database to scale in the future to handle enormous amount of data?
One of the motivations behind the rise of NoSQL is the ability to scale. However, we need to be more precise when we talk about scaling, as there are two different possible ways to scale.
Vertical scaling VS Horizontal scaling
Here’s a simple analogy to help you understand these two terms. Let’s say a company has an office building with 5 levels, just enough to hold 500 employees. However, they are growing rapidly and expect to have a total of 1000 employees in the near future. They will need to figure out a way to scale their office spaces, and they basically have two options:
- Adding more levels on the existing office building (this is not exactly realistic, but please bear with me here 🙏).
- Building a new office building beside the existing one.
Adding more levels is the same as vertical scaling — the office building grows taller vertically. In computer science terms, it would mean adding more resources (RAMs, SSDs, CPUs) to your existing machine. Vertical scaling is also known as scaling up.
Building a new office building beside the existing one is same as horizontal scaling — the office space expands “horizontally”. In computer science terms, it would mean adding more machines to your existing systems, whereas in database terms, it would mean adding more nodes to your database system. Horizontal scaling is also known as scaling out.
Highly recommend this short and sweet answer on StackOverflow if you are interested to read more about vertical and horizontal scaling.
You can always add more powerful computing resources to your existing database node, which means vertical scalability is usually not an issue. However, SQL databases are typically hard to scale horizontally. Why?
Let’s take a look at the Peter and Jane analogy again. In the analogy above, when Jane is away for a vacation, we said that Peter would ask Jane to make sure that no one else could read her notebook before she is back. Similarly, if we perform an operation which involves two tables on two different nodes of the database, these two nodes will need to communicate properly to ensure that no one else could access the tables until the update is completed. The complexity quickly escalated if your operation involves more than two nodes! Furthermore, things like Foreign Key
in SQL databases will be harder to implement if the two tables are on different nodes, since inserting a row will now requires checking another table from another node!
On the flip side, NoSQL made it easier to scale horizontally, because it does not ensure strong consistency thus free from all the complications mentioned above!
You might wonder, if vertical scaling is easier to implement, why can’t we just stick to vertical scaling and avoid all the troubles that come with horizontal scaling?
This is because horizontal scaling is almost always more preferable over vertical scaling. Let us revisit the office building analogy:
If the company chooses vertical scaling, they will need to evacuate all staffs from the office building while the construction is going on. However, if horizontal scaling is chosen, the employees can continue working in the old office building while the new one is being built. Likewise, if you want to add more resources to your existing node, you will need to stop the node temporarily and restart it later in order for it to utilize the upgraded resources, whereas if you chose horizontal scaling, the existing node can continue to run without interruption while you setup a new node.
Besides that, with vertical scaling, you have a greater risk of downtime. With only one node, no matter how powerful it is, you cannot access the database anymore if it crashes. However, if you have multiple nodes, you still have some other nodes for you to access the database even if one of them crashes.
Rise of NewSQL
Hopefully by now you have a better idea on the different aspects that you need to consider when deciding between SQL and NoSQL!
But the technology world is ever-changing! In recent years, there is a new player in the database world — NewSQL. It tries to combine the best of both worlds from SQL and NoSQL — providing the horizontal scalability of NoSQL while maintaining the wonderful characteristics of SQL databases such as strong consistency! Having said that, many NewSQL databases are proprietary software and are developed for specific use cases, hence it is still not as popular as SQL and NoSQL.
Let me know if you are interested in this topic! We can discuss about NewSQL (specifically TiDB) and concepts like ACID, BASE, OLTP, OLAP in another article!
That’s it for this article! Thanks for reading!
References
- Designing Data-Intensive Applications [Chapter 2]