To SQL or to NoSQL? That is the Question

When I first was learning about SQL and relational databases it seemed like the one of the easiest programming concepts to grasp. Having a database full of tables, each corresponding to a class, and each with columns representing attributes just made sense. How else would you store your data? It seems like the most logical way since we are all used to data being in tables. Well, as progress has been made in computer science as a whole, limitations to this SQL based system have come up. And with these problems have come solutions in the form of NoSQL databases. I first heard about NoSQL databases when looking online about different web development frameworks, specifically MEAN. The database in MEAN is MongoDB, and when looking up what it was I realized it was a NoSQL database. This inspired me to try and figure out exactly what that meant and how it would work.

What is a NoSQL database?

A NoSQL database is one that forgoes the tables with set schema and set attributes for a more flexible and less structured database. One of the main differences between the two types of databases is the way they store the data. Relational databases rely on huge tables full of instances of objects, and these have references, called foreign keys, to link to instances of other classes. In NoSQL databases, most of the time all the data for a single object is stored in one file. In addition to having a reference to another object, they would have that other object within the file. This can be both good and bad, but I’ll get to that later. They also have no set schema, meaning there are no specific parameters that each instance of a class are required to have. Attributes can be added or removed while the database is in use. To interact with a database it is queries similar to SQL queries are made. Each NoSQL database has an API interface in order to access data and write data. There are many different kinds of databases, but I will talk about key value databases and document databases.

A key value database is similar to the hash datatype used in Ruby. Each file in the database corresponds to an object, and within that file there is a hash that holds all the attributes for it. These can contain nested hashes or even nested arrays, depending on the needs of the programmer.

Document based databases are directly related to key value databases, but the main difference is the organization and the ability to read the data. Normally the data in key value databases is opaque, meaning that the database does not know anything about what is in the file. However, in document based systems the database relies on metadata within the file to find the correct information. This metadata is usually in the form of tags to label the information. The data in a document based system is also stored very differently. Instead of being a hash of keys and values it can be stored in different formats such as JSON, XML, YAML, or BSON. This makes it more flexible, and makes querying much easier with a document based system. The program is able to make more specific queries based on the content of the documents.

So why were these databases created?

What problems led to the boom in NoSQL databases? Well one drawback is that relational databases must always have a set schema. Why would this be a drawback? I think we would all agree that the structure of relational databases is what makes it so easy to learn and understand how we can query our database. This actually creates a problem for programs and websites that are constantly updated, and that need to store new data with different datatypes. Every time new information needs to be added to the schema, a migration has to occur. This means migrating the whole database to this new structure which requires the server to be down, possibly for a long time depending on the size, while it is updated. This means as a user of a program, you may not be able to access the program.

Another drawback of SQL databases is how they scale. As the database gets bigger it scales vertically. This means that the data is all on one server, which is responsible for everything. This limits the how big the database can get. It also means that more powerful servers are needed to keep the database running. Although it is possible to shard the database, (spread the work from one server to many), this requires extra programming work in order to make the network of servers.

Caching data is another problem that relational databases can have. Usually a SQL database will not cache the results of queries, but will only cache the query path. A distributed caching system to hold information that is used most often can be added as a layer to a relational database, but this adds more work for programmers and makes the system more complex.

How do they solve Relational database problems?

So how does a NoSQL database fix these problems. A major problem they solve well is the need for an ever changing schema. A NoSQL database can store data in many different ways, and new attributes of an object can be added or removed. This allows for a dynamic schema that can be changed on the fly, without having to take a server offline to migrate a new schema.

Sample document file from MongoDB docs...
{
_id: "joe",
name: "Joe Bookreader",
addresses: [
{
street: "123 Fake Street",
city: "Faketon",
state: "MA",
zip: "12345"
},
{
street: "1 Some Other Street",
city: "Boston",
state: "MA",
zip: "12345"
}
]
}

It also allows for objects of the same class to have differing amounts of information. For example, PersonA might have an address for home and work, while PersonB may only have one for home. PersonB doesn’t have a nil value for his/her work address like a relational database would have, they just don’t have that information. There are no empty fields for PersonB, which greatly reduces memory usage. Even though the table would only have a nil value in PersonB’s work address column, this nil still takes up space in the database.

NoSQL databases naturally scale horizontally, meaning that they can be spread out through multiple servers. Instead of needing extremely powerful servers, a group of commodity servers can evenly split the querying and workload automatically. Not only is this cheaper, but it is possible to use services like AWS (Amazon Web Services) to host a database for an application.

Caching is handled much better in NoSQL databases. Most of these databases support integrated caching. This allows the database to keep the most used and accessed data in memory, at all times, for quicker queries. With this information already kept in memory, the query does not even need to go all the way to the database.

This seems too good to be true!

And it actually is. If NoSQL databases fix all the limitations of SQL databases, why isn’t everyone using them? Well with all the improvements that have been made, there are still several problems with these dynamic databases. One major problem is the lack of ACID transactions. While some NoSQL databases claim to adhere to ACID standards, they are usually not following it fully.

What is ACID, and why is it important? ACID is a set of guidelines that database transactions should abide by, in order to prevent bugs or incorrect data in the database. It stands for Atomicity, Consistency, Isolation, and Durability. Atomicity refers to a transaction either having all parts occur, or no parts at all. If one part of the transaction fails, then all other parts should be rolled back and reverted to their original state. Consistency means that the database should only be changed from one valid state to another. This means the transaction should be valid, given all rules and constraints set forth by the database and programmer. Isolation means that given a certain transaction, the result of it being executed concurrently should be the same as if executed sequentially. Durability refers to the idea that the database should record the transaction and commit it to memory, even if the server loses power or shuts down unexpectedly after the transaction. This is done using non volatile memory stores that do not require energy in order to persist the memory. Without following these four guidelines, the database is susceptible to errors and crashes.

Most NoSQL databases also do not support joins in their querying. There are some ways that work around this, but they can lead to errors. One work around is that instead of doing one query to try and get all the related information needed, it is possible to do many smaller queries. This is possible because NoSQL queries are usually much faster, meaning that even if multiple queries are needed to get certain results, they can still be quicker than doing a join query in relational database. What I mentioned earlier about storing the data of other objects in addition to references to these objects can also solve this lack of joins. This allows one query to a single object to get all relevant data, such as a blog post and all its comments. But this leads to another problem called eventual consistency.

Eventual consistency is an idea that came about because of the lack of ACID standards. So let’s say a NoSQL database has all the information for a blog post in one file. The data in the file, such as the comments and the usernames of people who wrote comments, might exist in multiple places in the database. So if one instance of the data is changed, like the username of someone who commented, it is not always changed everywhere else. This is what eventually consistency refers to. Its the idea that the database will eventually normalize and create the changes everywhere. But within that time that it is not consistent, bad data can be returned and used by the program. While this is usually only a couple of milliseconds long it still can happen.

Another reason these databases aren’t being adopted by companies is that they are already so invested in relational databases. So much money and time has been put into already existing systems that it is not worth the trouble. It would be quite difficult for an already established company, with a huge database, to switch over and move their data into a NoSQL database.

So which should I use?

Overall the choice of which kind of database relies on your needs. NoSQL databases are perfect for programs that require ever changing objects that have new attributes, or information added to it constantly. Companies such as Facebook, Netflix, Google, and Amazon all use NoSQL databases. In fact, many of these companies developed databases for themselves to use, and which they have made available to other companies. This is because these companies require the ability to alter their databases on a constant basis to keep their services running smoothly. Anytime Facebook adds a new feature they can simply add the data necessary to their database without having to take it offline or migrate. However if you know that your program will not need to scale, and will stay relatively small a relational database may be better for you. They are easier to work with because SQL is a much better querying language than any of the APIs used in NoSQL databases. Relational databases are also better in that that they have an agreed upon set of guidelines, which makes them consistent from company to company. So it is really up to you as a programmer to decide which database makes sense for your application.

Sources :