Selecting application database

Selecting data storage technology I would name one of the most important technical decision that has to be made. Wrongly selected data storage can prevent you from being able to meet software requirements. Database choice requires thoughtful design and research investment and should be done with maximum responsibility. Here I want to share my knowledge and thoughts about how to choose right database for your task.

Understand difficulty of the problem

While some applications need to store petabytes and process them in sub-milliseconds other applications might need simple storage for the data that fits in memory of your laptop and can wait seconds to be available. Not all applications are equally tough and technology choice is not equally important. Mission critical systems require thoughtful design: feature comparison, prototype implementation, performance testing but not every system needs these steps. For simple tasks you might use already adopted technologies. Invest time appropriate to the value of the problem.

Gather requirements

Everything in software engineering starts from that. Try to understand your data model and ways your application will interact with it. Try to estimate the size of your data. Get throughput and latency requirements. Understand which consistency guaranties required and if linearizability required.

Consider SQL by default

Today SQL vs NoSQL is not about SQL. Traditional SQL databases like PostgreSQL or MySQL already support unstructured document store in JSON or other format while NoSQL databases evolved query syntax to reach capabilities close to SQL. The main difference remains in fact that SQL databases are built by default for non-sharded setup with single write instance and few read replicas. This is the way how full ACID can be met without heavy 2PC-like protocols. Consider anything else then “30-years old legacy” SQL database only if SQL cannot meet requirements in terms of size of data, throughput, latency or query capabilities. With SQL you are sure you can query almost everything and make data durable and consistent at any time. You are on the safe side if requirements will change or if you realise that you made wrong assumptions on your data model.

Anything better comes with the price

To do anything better then SQL database we have to sacrifice something. Usually this includes consistency, durability and ways how you can access and manipulate your data. Need to write more data faster? Don’t write immediately to disk or don’t update indexes immediately. Writes cannot be handled by single instance? Remove consistency between documents and spread them across the cluster. Want to have faster access by key? Change how data is stored and remove some query capabilities. These are rules of the game caused by real world hardware limitations. Every characteristic of NoSQL database that outperforms SQL comes with sacrificing some SQL database capabilities. There is no free lunch here.

Right question contains the answer

Assuming the problem is big enough or specific enough. By classifying the problem we find and answer already. Need to store key/value data? Look on key/value stores. Write intensive workload? Look on write optimised solutions. Want to store relationships? Look on graph databases. But only if your problem is big enough or specific enough.

CAP classification is not that helpful

http://blog.nahurst.com/visual-guide-to-nosql-systems

While CAP theorem is absolutely valid classifying databases into CP, AP, CA is usually not. Most of NoSQL databases can be tuned to level of consistency that is required. Focus rather on understanding database architecture and which problems and use-cases it typically solves. Find experiences of others.

Don’t trust marketing to full extend

Every database according to white-papers is best, fastest and the most reliable in the world. No one will write limitations on the main page or in whitepapers. They are usually hidden in documentation or even hidden that much that revealing them becomes surprise to many people. “Jepsens” by Kyle Kingsbury is a good example of exploring database limitations in hands-on tests. Measure don’t trust. Benchmarks are tricky business. You can find a benchmark where database X outperforms database Y and another one with database Y outperforming X. Only your setup and data can give the confident answer.

Understand limitations of free editions

I cannot name any single popular database behind which there is no company making money. Even those technologies born and open sourced at companies like Facebook and Google have already consultancies and companies offering commercial editions. More tricky are the databases born because of NoSQL trend. They are developed by companies making money on these technologies which is fair enough. That’s why free edition might be very constrained forcing you to pay for commercial one. Understand limitations of selected edition. Frequently they include security, replication, monitoring, bug fixes and are vital to many systems. Do this early in your project and align with your budged.

Understand limitations of hosted options

There are many databases provided as a service such as Heroku Add-ons. You don’t have to deal with infrastructure, replication setup and backups. Everything is there to start implementing application. Usually hosted databases are more expensive then just bare metal machines where you can setup any database with license. But price difference is not the only cost. Database maintenance as a service prevents you from being able to use some administrative capabilities directly. This means you might not be able to create multiple users with different permissions, install extensions or even change durability/consistency level. What you will get for sure is database communication protocol compliance but not all the features. Evaluate these limitations carefully.

Is it right time to play

Experience matters. Knowledge inside the team matters. Old and known is better then new and unknown. Heterogeneity brings complexity. Distributed systems are tough. Building databases is tough. Everything old and tested is better then new and fancy. Trying new technologies is always risk. Trying multiple new technologies at the same time is big risk. Not every task and circumstances allows to take this risk. Understand if you can allow yourself trying new technology which should be better then already used and adopted one. But not every problem is a nail you can fix with a hammer. There is no “best” database. There might be the most suitable one for your problem.