NoSQL or not NOSQL: Learn when you should use it
In the last years, we follow the rising of NoSQL technology and its employ in an even more extended set of application. This article aims to make an objective comparison from SQL and NO-SQL technology and try to clarify some unclear aspect to help people to choose it’s backend knowledgeably.
- What is NoSQL
- NoSQL implementations
- Advantages of NoSQL
- Limits of NoSQL
- Performance comparison
- Points of interest
What is NoSQL
In simple words, NoSQL is a new data storage backends that do not follow the relational DB model. That means we are speaking about a “container” that works differently from traditional SQL based backends.
NoSQL technologies born to cover a set of new need came out while the traditional database was already mature. Of course in last years application needs changes and became even more discerning (big data, clustering, file repository), so this new storage system was designed keeping in mind these new requirements.
But, what I mean for “requirements”? Here a set of cases that NoSQL is designed to support.
- The application works with a large amount of data (Big Data)
- The application works with data that quickly change relationship and data types (semi-structured, unstructured and polymorphic data).
- Developers work in a small team using agile approach: a lot of small spring against long term waterfall iterations
- Application served as service, may be published on the web,
- Application delivered to thousands of user instead of few people inside the company
- No certain about the future load of application: need to be scalable and dynamic, need base software on a backend cluster easily
There are many NoSQL solutions offered by the market, open-source or not. Each of them works a little bit differently, maybe specializing so some particular need, but the basic idea and the common feature is to offer better scalability and performance. To do this they give up to some features of generic RDBMS, introducing new ones, but keeping enough functionality to be useful.
One of the breaking changes from SQL DBs is that, meanwhile, SQL backends are a general purpose storage system, NoSQL distribution is focused on a specific kind of data. This allows DBs to be more efficient on their scope and allow to have a more performant system. In this section are reported some kind of NoSQL database to give an idea of what application they have. Note that they can be used together (also with traditional SQL systems) to get the best from all systems.
This type of database doesn’t need to have a consistent data structure, so they are useful when you have to work with polymorphic data, or data structure changes constantly. This kind of backend can convert normalized entities like key-value dataset or EAV models to simple documents set.
- Goal: Storage non-typed set of “records”, called “documents”
- Examples: MongoDB, CouchDB
- Target: Heterogeneous data, working object-oriented, agile development
We told NoSQL database removes the concept of relationship to achieve better performance. In this kind of db this is not true. Instead, these databases enforce the concept of relationship.
Their goal is to define data by its relation with other data. This kind of database can be useful when most of the data structure is designed to keep the relationship with entities (i.e.. if you have a lot of tables when there are mostly foreign keys columns).
- Goal: describe data relations
- Examples: Neo4j, GiraffeDB.
- Target: Data Mining
This is a kind of db designed to store a big amount of key-value pair data. This can be useful when the database is used to store properties, translations, or caching purpose.
- Goal: Store data in the key-value form
- Examples: Redis, Cassandra, MemcacheDB
- Target: key-value storage
Advantages of NOSQL
We are aware that NoSQL db has some interesting advantages and they can resolve easily simple problems that traditional RDMS can’t. Nowadays their large employment in a critical system, like big cloud systems and some large Saas product, confirm they are mature and useful. But question is, why I should move to them? And in this case, when the move is profitable? We can’t base such kind of decision only on our impression, and read some vendor brochure where it’s granted that NoSQL is very cool is not enough. Moreover, we cannot stay on an inadequate platform just because we fear the change.
In this chapter, I’ll try to explain why this solution could be good enough to move on it and what use case make it more profitable.
As we said NoSQL databases were created in response to the limitations of traditional relational database technology. This means we will find some improvement, or better, some feature that in traditional RDBMS are not present and can’t be added, even if the producer would implement them.
The advantages of NoSQL include the capabilities to handle easily:
- Big data: where with this term we describe data sets containing very large volumes of data.
- Mutable data: Data could be structured, semi-structured, and unstructured as well. NoSQL can also manage the transformation of data.
- Dynamic development: In contexts where we need agile sprints, quick iteration, frequent code pushes, and in summary to be responsive to changes, having a database that embraces dynamism is very helpful.
- Object-oriented: programming that is easy to use and flexible
- Expandable: we can easily implement an efficient, scalable architecture instead of expensive, monolithic architecture. Even in traditional db, we can do that but it is harder and limited.
- Open Source: most of the solution is open source, so with no cost for licenses
NoSQL databases are more scalable and provide better performances, and their data model it’s closer to the domain model used inside the application. Today, companies starting project basing on NoSQL databases are growing. NoSQL databases also tend to be open-source and that means a relatively low-cost way of developing, implementing and sharing software.
Limits of NO-SQL
In evaluating the limitations of NoSQL databases, it’s important to bear in mind that the NoSQL world is a diverse ecosystem. Not all NoSQL storage products are subject to all drawbacks to the same extent. And that’s a good thing since it means that organizations have a lot of options when weighing the pros and cons of different NoSQL solutions in order to decide which one is best for their specific needs. In this chapter are summarized some feature you may miss using a NoSQL solution.
By reading the article you’ll find this chapter expanded more than the advantages one. This won’t be a way to discourage to use NoSQL. This chapter would be an impartial description of all limits of NoSQL technologies and simply want to let you know every possible problem you could find employing them. Many points could be varied by the implementation (i.e.. when I said there are few tools in support, it could be right for most of these but not for all ), so consider them just an overview that will alert you about the possible risk you could find. What I expect is that, after you choose a NoSQL product to employ, you can use this chapter as a checklist to understand if this issue is present in you specific db and if is relevant for the application.
Security is something everybody wants but that’s hard to reach. Theoretically, there could be security issues on every technology There was, and maybe there is security issue also on SQL system. So why I mark this as a possible issues on NoSQL? There is no real issue on NoSQL “concept” related to security, but we can have security issues related with the maturity of the product we employ. Security issues come out while the product grow and they are fixed. Intuitively, a young product could have many unknown security issues. Moreover, a young product is on the market for few time, so consultants hadn’t time to make the experience on them and many security constraints could be neglected. So, the problem is the youngness of most NoSQL platform. For business use, I suggest employing only mature solutions, with a vendor behind.
When we start learning RDBMS, they have taught us that ACID transaction is the best option to make operations that remains consistent across the whole database. Well, most of NoSQL technology doesn’t implement such kind of transaction. NoSQL systems are based on the principle of Eventual consistency. In practices, embracing a little risk of consistency (one node could be out of sync with others), they gain some performance boost. Yes, it is a compromise, but we cannot have everything.
I have to mention that some NoSQL implementation, like FoundationDB, allows ACID-like transactions keeping NoSQL performances high. By the way, while we stay on NoSQL, data consistency remains a critical part: basing on the application you are developing this could be an issue or not.
When you speak to someone that is trying to convert you to NoSQL technology, one of the first benefits you could hear from they is the benefit of performance due to the removal of relations. All we agree that relation may bring performance reduction, but what we lose removing them? It’s like you was in dip rise with a big and heavy backpack on your back. Of course, dropping it you will go faster. It is convenient to do that? Depends on what this pack contains, depends on what is the value of the backpack content has for you. If it contains the tent for the night, maybe is better to reach a destination one hour later but sleep warm then go faster. If you are bringing useful disposable things, maybe you can do the opposite.
Following this parallelism, can we accept to loose consistency to get performance? It is convenient?
To make a step back, I’ll start from the origin of joins. RDBMS use the relation to link data from one table to another in order to keep data in a single place and do not replicate them. Join is constructed that allows to reconnect them in the query. Of course, make a join between tables is an additional computational cost, respect to finding data directly inside the table you are querying. But this cost is necessary to keep relation (no replication, consistency).
It’s clear that while such construct has an acceptable overhead, this is ok, an probably the best choice. But what when it slows down everything or require too much hardware? This issue allows NoSQL developer to claim the lack of JOIN to a feature, but NoSQL is the solution?
Not always. Sometimes we just need to redesign database structure, maybe removing some relation or make restructure data. Yes, we will lose some relation o we will replicate some part of the date, but it could be acceptable ( in NoSQL we will lose all of the relationship).
Another problem is with consistency. Think about category and products. We may have a nested tree of categories, with many products as the leaf of the tree. In traditional RDMS, change the category tree is just an update on a foreign key (self-relationship) on the category table. This changes automatically reflect on all child categories and products. In the NoSQL way, we could have redundant data on all category\product and a change will need a massive update on child elements.
Let me assume that our application can renounce to JOIN for gaining speed and in our case, it is an acceptable trade-off. We yet said that in may NoSQL implementations it is hard to keep the various entries consistent. When you work without transaction you can make many operations in sequence but after a crush you get inconsistency. This is true for firsts implementations of NoSQL and some new technology try to give out of the box something that offers out of the box transactions. You can also think to manage transaction at the application level, trying to rollback dirty data, but it could be very hard to manage in any case.
Missing standard across vendor technologies
SQL is a standard language. There could be many variations that bring to specific dialects, but this is complex doesn’t forbid to abstract data access. Think about Hibernate, NHibernate, Doctrine, Entity Framework or the other ORM you like: they are the proof that the difference between SQL dialects is not so relevant. We can conclude that SQL is a standard language, even if many vendors implement different database technologies. Also in case you are not based on ORM layer, if you produce a query for a db, most of the code could be reused in others. This makes migration easier and developers can adapt quickly to different DB solutions.
On the other side, in the NoSQL world, there is more confusion. Each vendor implements their specific syntax without involving any shared standard. This means it is harder to migrate an application between different NoSQL implementations. This means it is harder to find a programmer that knows well many NoSQL technologies.
Schema flexibility could be a trouble
One of the peculiarities of NoSQL system is that they do not require a schema. In practices is the programmer that decide data structure in the moment he saves it. So there is no place where it’s written how data is structured of what is the meaning of data. Even if you could easily recreate a DB model starting from data relation using some automated tool, this could be something missing in traditional applications. Moreover, what if a bug occurs? We know there could be situations where something is wrong with the code. Traditional RDMS are scaffolded, so in case you switch some fields or you are wrong with field format they protect you from inconsistency. In NoSQL case there is no help from the DB, because without any schema defined, there isn’t any information about data should be saved: nobody can say if data is wrong or not. The worst side effect is that process bring a lot of power and a lot of responsibilities to the developer, that often don’t know all the process or the structure.
Moreover, even in case you now know what is saved where do you think you’ll remember everything the next month? and the next year? Not all project are subject to continuous development, there could be the business application that stays as-is for years, before we need to make some changes. Anyway, in IT, companies often commission a project to some supplier so this part has to be taken in account to ensure an easy handover at the end of the project, maybe asking for accurate documentation about data is structured and what each field\collection means. The last problem related with schema flexibility is that every member of the team could not work in the project for all its life, so turnover is critical on little teams were not all the members have full knowledge of data structure or there isn’t adequate documentation.
Saving many nested data inside single documents you may lose analytic features like “SUM”, “COUNT” and so on. The bad things that this could not be a problem during first application development, but someone could ask later for some report, so what to do in this case? It is hard to change data structure after the database is filled, and doing that could have may unpredictable effects due to the leak of a well-defined data structure. Analytics is a hard point for NoSQL.
Moreover, while there are many commercial tools you can connect to your traditional DB to manage analytic part, there is limited support for NoSQL systems.
Another solution could be taken is to replicate some sort of “relationship” with unstructured data inside NoSQL DB, maybe creating many collections and linking object one with others. If you are planning to follow this path to allow analytics reporting, keep in mind that this could slow down performances to be comparable with standard SQL systems. This could be acceptable when the part involved in this DB is minimal and the record count is limited. Anyway, even from my experience, construct that allow to join data on NoSQL query, because there aren’t well-defined relationship behind, are very limited and performance are not so good as we expect (i.e.. at the time this article was written, MongoDB do not support inner join and can evolve only to table each time without creating temp tables).
We spoke about the lack of standardization on NoSQL query language and syntax. This problem may be reflected also in tools, together with youngness of most platforms. I’m speaking about tools for querying, but also for migrating data between database, manage backup, etc..
Of course, most NoSQL projects are growing, and we expect tools will grow with them, so this problem will be automatically solved in the near future.
The lack of standardization makes hard for third-party vendors to build tools that can support multiple NoSQL solutions. Moreover, young platforms mean fewer users, less customer, and less time to develop mature tools.
It is important to specify how comparison is made. First of all, I needed to place both solutions in the same condition. This means, for example, to use the same hardware and have the same level of tuning. So I installed MongoDB (last version) and SQLServer Express on the same machine. Because we are not interested in performance inside the database itself I built my benchmark using C# code basing on the standard framework.
Upon this two way to save data, everything is shared (entities, logic, data generation) to ensure equity.
The list of all operation we will compare:
- mass insert
- transaction (or better, in NoSQL case, transaction simulation)
Mass operation on a single entity
This benchmark consists of a big set of object to insert to be performed in less time is possible. This test is replicated using a growing number of items to save to prove how performance scale in both systems. This benchmark measures execution time in ms and insist on a single table\collection.
This benchmark is focused on query feature. We separate the following pattern:
- CASE 1 get one entity using primary key: this pattern is used to fetch a single entity from DB using is a unique identifier
- CASE 2 full scan with fail: when you are looking for a deleted element and database have to scan all index before reply “no”.
- CASE 3 Paged query: a complex query where you have some filters, one order condition, and you want to take just a page of data.
I created some benchmark simulating different ratio of patterns above. In example, the first benchmark assumes 5% of queries of kind 1, 70% of kind 2 and 25% of kind 3. This benchmark measure execution time in ms. This benchmark insists on a single table \ collection.
You can find all the code used to perform these test on git-hub.
Benchmark description CASE 1CASE 2CASE 3Benchmark 15%70%25%Benchmark 210%45%45%Benchmark 315%8%77%
First test is on a “small” set of data, about 2.500.000 rows.
Performance comparisonbenchmarkNoSQL(ms)SQL(ms)Benchmark 147244Benchmark 2113Benchmark 3112
The second test on a “bigger” data set, about 5M rows.
Performance comparisonbenchmarkNoSQL(ms)SQL(ms)Benchmark 156287Benchmark 2217Benchmark 3218
This benchmark highlights a big improvement of performance on query over the index, but when MongoDB is used to read a set of data the gain is reduced and kept stable over data increase.
We know transaction on NoSQL world is mostly unsupported. We also understood that renouncing to the transaction could benefit from performance, a question is: how much I gain from this? I built this benchmark to compare insert in the transaction of one master row related with many children. Benchmark is focused on execution time, expressed in ms.
Performance comparison#transactionsSQL (ms)NoSQL (ms)10998910010047641000110257309
This benchmark is focused on analytics. Suppose we have a categorized master-detail data model, where you want to:
- export: whole join overall data tree
- report: sum all items in a category for all category, i.e. give invoice amount for all customers
- KPI: sum all mater totals summing detail subtotals
On a base of 4M row after inner join:
Performance comparisontestSQL (ms)NoSQL (ms)KPI11761403REPORT8051363
Points of interest
Revolutions are inevitable in a tech context. A new technology comes and bring some revolutionary features, but often have to defeat developer preconceptions. Sometimes are misunderstood so their weaknesses come out after they are employed.
About innovation we have to “opposite” cluster of people:
- “enthusiastic” people: that want to embrace change unconditionally and are ready to dispose of everything done in the past to work with last technology;
- “conservatives” people: that hate changes and prefer to stay on its habits, rejecting any new technology.
On real life, we have to stay on the middle, so it is important to know and understand what new technology could do for us and be ready to employ new technology before they are needed on projects. “Trial on the job” is a bad habit that can lead to bad results.
The same principle is applied to NoSQL technologies. Because we looked at NoSQL some times ago, now we know pro and cons, so we can take advantages of such kind of tools. When we analyze that technology, we cannot stop looking at the things we miss from our traditional habits, like transactions, schemas, and standards. We need to study and familiarize with these technologies that until some years ago was young and new, but now are a concrete option. Study, learn, understand, employ: this is the nature of progress.
When I should use a NoSQL Db?
Why NoSQL would it be better than using a SQL Database? After you have read this article I’m sure you’ll understand that NoSQL isn’t a replacement of SQL database, but just different storage system with different features and are usefully in some particular fields. So the answer cannot be different from “it depends”. Because it really depends on many characteristics of the project.
To be honest and cautious, NoSQL is the best solution when all the following statement is true:
- when your project needs to scale, or it could in the future.
- when you have to work on big data or your data will be big in the closer future
- when analytics component in the application is simple, or not so important
- when your application needs fit database purpose (i.e.. you are saving data in graph and database do it)
In some cases, NoSQL could be a good alternative, but not essential to build a durable infrastructure. Of course, if in your application a NoSQL system cover the 99% of all needs, there is no reason to couple it with an RDBMS. But if you need relations, transaction, and other features of standard RDBMS maybe it’s better to employ them as the main storage system and use NoSQL to cover only the critical parts (may be derived from the size of data).
How much better are performances in the case above?
It depends on the specific use case. From one side we have lots of benefits on large tables or massive usage, but from the other side, we have some performance loss using lookup instead of joining on a small dataset. A realistic estimation, if there is the basis to employ a NoSQL DB, we can improve performance from a 10 to 100 factor. Of course, this estimation takes into account all aspect of an application and it is related to the final user experience. That means you could measure better speedup on the db layer, but the final user experience is distorted by many factors that reduce the gap (cache, network delay, page rendering). Just to explain what I’m telling, take in the example the limit case when there is a page that makes a query and returns data. Assume for this page 500 ms for getting the result using the traditional database, 50ms using NoSQL, and 200ms for rendering the page and 1 second for transfer data over the internet. Improvement of performance on the db layer is -90%, but for the final user, there is only 450ms gained in 1700, so only 26%. With this example, I would explain that is hard to measure improvement on a complex system and in many cases, NoSQL is not enough to resolve the performance issue. To be more direct, if you think to resolve the performance issue due to bad design in an application moving to NoSQL, you are not in the right path.
But the big question is: what I lose to get this performance? Because in some case it is not possible to renounce to some feature like transaction or relationship. This is very important to understand before the move.
NoSQL system is mature to be employed in a production environment?
Depends mainly on your needs, or better on the project requirements. We can state that NoSQL is certainly mature enough to use. So, if you need, you can use without any fear. But not all application needs to work with big data or to scale that massively. Most of Saas product does, also much critical application inside the enterprise context does, but most of the application nowadays are still very simple. On my experience, it’s hard to find tables with more than 100 000 rows in a database. Think to your DB, exclude the 2–3 bigger table you have on it and look at the row count. How much big they are? Usual DB structure on DB application count lot of “little” table (less than 100 000 rows) related. For such kind of application, a traditional RDBMS is sufficient and it will be forever. What is important, instead of starting to employ it, is to understand the benefit and development pattern to be ready when they will be needed on your case.
Is SQL obsolete?
When men invented the airplane, cars became obsolete? No, of course. Also if airplanes are faster than cars. They simply are two different systems to move people, with different characteristics. Basing on the kind of travel you are starting, the time you have to spend on the travel and the budget, you’ll decide what’s the best suitable alternative for you. In the same way, NoSQL coming don’t make obsolete SQL. They are simply two different way to store data, with different characteristics. You’ll decide what’s best for you based on your need.
There are problems SQL isn’t suitable for, so You don’t have to start big data project with it. It will be like to try to reach an island with the car instead of by plane. But SQL still got its strengths. Lots of data models are simply best represented as a collection of tables which reference each other. It will be like trying to go to buy the milk using a plane. NoSQL databases aren’t a replacement for SQL, but they are an alternative.
Is the market ready to NoSQL?
The key point to answer to this question is closer to the experience that the developer achieve. Most database programmers were trained for a year to think of data in a relational way. How they could change to thought differently in so few time? It is not so easy, especially when the developer has to work in many projects together and some of them are SQL and other NoSQL. The temptation of reproducing the same pattern on SQL to NoSQL system is hard to defeat and often lead to bad results.
So actually, there is much more know-how for SQL around, there is more developer well skilled on RDBMS than on NoSQL. Meanwhile, there are DBA that spends most of their time into focusing on relational databases, we can’t expect to find the same on technology born less than ten years ago. SQL is reached at school and in universities, NoSQL is starting to be.
After this first point, a secondary one is that because these systems are newer there are fewer development tools, or they are not so advanced like others, but I’m sure this is not a real problem. There is some “enterprise ready” solution, that comes out with tools that manage all basics need and all we hope that these tools will be growing with the growth of their platform.
What is the best solution?
There is not the best solution that covers any case. The answer is simple and still the same: “it depends”. With this article, I hope to have given you an overview of all capabilities of such systems and some basics to understand when they could be useful