Choose the Right Database
Understand the fundamental differences of databases and how to leverage them.
MySQL, Cassandra, HBase, Redis, etc. There are so many databases. One might wonder why do we need so many? In this blog I will take you through the practical limitations that require us to have these DBs. It should provide you with a starting point for choosing a DB.
Traditionally and even now, industry extensively uses RDBMSes like MySQL, Postgres, etc. It’s a simple concept to understand and works brilliantly for most use-cases. The problem occurs when we need to scale. RDBMS can only scale vertically. i.e. getting a more powerful machine. But there is a limit to how big a machine you can buy. For e.g. at the time of this writing AWS provides 3TB RAM and 128 core instances. It sounds big but what if you need to manage billions of active records and millions of read writes per second? The only solution is to scale horizontally by using a clustered DB.
In clustered databases, data is partitioned across multiple nodes which are interconnected over a network. In such databases, under heavy workloads, some nodes may go down or there may be network latency, network failures, etc. The DB is expected to work well even during such cases. This is the partition tolerant nature of the DBs.
Let’s consider a DB table with partition tolerant nature. It receives a write request on one of its node. Now during this operation a read request was initiated. The DB has two options on how to execute this read operation.
- Wait for the write to complete and its success or failure to be communicated to the cluster. It may then initiate the read.
- Return the partially updated data even when the write was not complete.
In the first case the DB is waiting for the data to get consistent across cluster before initiating a read. Such databases are consistent under network partition. However, if the write takes too long, it may return a timeout error.
For databases taking second approach they will be providing availability under network partition. As the name implies the data is always available irrespective of whether it’s getting modified or not. The diagram below should explain this approach:
We summarise these concepts below:
- Partition tolerance: Return data despite network delays, failures and node failures.
- Consistency: Return the most recent write else an error.
- Availability: Return the data that the request sees during query execution.
CAP theorem is just the observation we made above. Under network partitioning a database can either provide consistency (CP) or availability (AP). Note that a DB running on a single node under a some number of requests and duration execution time will provide both consistency and availability.
Here we always mean strong consistency whenever we say consistency. Some databases refer to eventual consistency. Such databases are not consistent under CAP Theorem and will fall under AP.
Choosing a Database
How will CAP theorem help us? It depends on your use cases.
Use-case of a CP Database
Let’s say you have a stock brokerage service. The users of this service are supposed to be shown the most recent price. Showing an older price is not desirable. Worse yet if the prices are not consistent then some of the triggers, to buy or sell stocks, may not hit because of which the users may suffer a serious loss! In such cases a CP database is desirable. Also note that since only the stock market writes the new prices in negligible time, for individual stocks, we don’t have to worry about waiting for writes to complete.
Take another use case. Let’s say we have to create a customer order history page for a e-commerce website. Using an AP database will display less number of orders if a node with inconsistent number of orders were hit. This may result in the customers calling the customer care to find out what’s really going on. However, a CP database can return a ‘please try again later’ which is OK.
Use-case of an AP Database
Now consider a use case where an e-commerce company has a table with product info with of millions of products. A developer is asked to do a bulk update to product prices. In such cases the update will take good part of an hour. We will have to wait for all prices to be updated before any read query takes place. This is not desirable as the users of the company will face a downtime resulting in loss of business. We will have to return the partially updated data till the update is done. We will have to pick an AP database.
As a rule of thumb, while choosing a DB you really should break down your use-case into the read and write durations and whether you have so much data to warrant a clustered DB in the first place. If the data and number of requests is not large enough and the queries have low response times consider a CA database. For clustered DB identify cases which can cause your DB to timeout. If you can live with the timeouts go with CP databases. Else the only remaining option is an AP database.