Designing a system with right Data Store

Published in

Hash#Include

5 min readFeb 21, 2020

This article would talk about various factors which can be considered while choosing the database. Choosing the right data store plays an important role in designing large scale system as it becomes difficult to change it later when your system is big and handling large scale. And you can’t mess with your system at this stage. It might also requires to change the architecture of your entire system. Here system is referring to a domain which can have multiple bounded context, and each can have multiple micro-services. So we need to be very careful in terms of migrating the data store, and knowing the impact of doing it.

As a system designer it is essential to understand internal working of various data stores which requires to know about indexing strategies, storage engines, performance in terms of read-write queries, sharding strategies, local index vs global index(sharding/partition ), rebalancing partition, replication strategy. etc. We also need to understand whether system is read heavy/write heavy, how do we access domain entities, do we access all related entities/relation together ? We cant design a well scalable system if don’t understand all these concepts, or we might have to face consequence of choosing wrong database.

Now let’s talk about sequence of steps which can be followed while designing a new system.

Understand the business problem, and come up with initial micro services : As a system designer, we always solve a business problem. We might end up designing something else if we don’t get it completely. Always try to understand business problem first, and then think of designing domain, sub-domain, bounded context, micro-services around it. We can take help from someone who is expert in given business domain. This is very critical part of designing a new system, if get it wrong we might have services which are not loosely coupled, autonomous, require lot of chatty behaviour, not having clear seperation of concerns. Try to avoid two way communication between services. There are ways to avoid it. Always try to follow standard design principle/patterns as it would help a lot in future in terms of adding more feature, maintenance or diving it into small micro-services. If code base is not properly organised properly, will have hard time splitting it into multiple services.
Find domain entities : Once we finish step 1, we would have clear responsibility of each service. And it is the right time to think about domain entities. Your entities should be extensible enough in future. Now try to model various entities for given service. We should also understand whether relationship is one-one, one-many, many-many. This is one of the crucial part which will help you choosing right data store.(There are various data stores available eg: relational/sql, nosql(key-value store, document store, graph store etc, will mostly talk in terms of sql-nosql here)
Choosing the right data store : Steps 2 and 3 might go together. As your entity design might change with data store eg: Having 3 different tables vs self contained document. Try to understand how consumers interacts with domain entities or read/write part of services. Also check how interaction works in background jobs. If we understand read- write part of our system, we are good to choose right data store. We don’t have to always choose relation db. If your application needs related entities together most of time, and if it has one-many relationships. We can go with nosql databases as it provides storage locality here. There are many other factors eg : indexing strategies, storage engines, performance in terms of read-write queries, sharding strategies, local index vs global index(sharding/partition ), rebalancing partition, replication strategy which needs to be considered while choosing the data store. I would be focusing on this part in next series of posts where we would take a business problem and try to design it.
LSM, B-Tree : We should also understand underlying data structure used in various data stores eg: LSM tree, B-Tree. LSM tree is log structured merge tree, which is append only kind of data structure where it doesn’t mutate the existing stored entry, instead add new one at the end. It maintains memtable, and SSTable internally. Memtable can be understood as a in-memory balance binary search tree which stores incoming data. After certain threshold it gets stored into SSTable(disk), which stores it in sorted order. There can be multiple SSTables on disk which gets merged with configured compaction process. LSM tree based data stores allows you to write data faster as it is append only structure. Read are not as good as writes. It might have to scan memtable, followed by multiple SSTable while serving read requests which is not that efficient. There are ways to to optimise this part. Read works better than write in data store which is based on b-tree as lookup is efficient here.
Most of the application uses distributed data store now a days. First, data store which provide master, and multiple replica where you perform writes on leader nodes. It replicates incoming write logs to other nodes. Reads can be performed from leader, or replica nodes. It is better to split read, and write in your system. But this comes with eventual data consistency. You might not get latest read from replica. Let’s say if given system doesn’t require to store huge amount of data or say it can be stored on single server. So we have a data store where which performs write on master, and read on other nodes. It does replication of incoming writes to other nodes. There are two types of replication on higher level. Synchronous, Asynchronous. Synchronous replication will block write operation until replication process doesn’t complete on other nodes. It might have considerable impact on write apis as it doesn’t return success until it gets acknowledgement from other nodes. Asynchronous replication wait for acknowledgement from at least one node.
Second, data store where you need to store huge amount of data. It is difficult to store data on a single node, and need to get better performance in terms of read-write. It basically make use of partitioning. If we are using data store which make use of partitioning, consider understanding various partition strategies eg: range partitioning, hash based partitioning, hot spots and skewed node, data consistency(This gets complicated here) part of it, local index vs global index.

Thanks for reading my article. I love working on complex business problems, and designing large scale system. Have tried summarising my learning from designing systems here. In next series of article would take a business problem and try to design system, choose data store/entities. Fee free to reach out to me on
Linkedin : https://www.linkedin.com/in/bhagwati-malav-684b6a5a/
Gmail : bhagwati20malav@gmail.com

References :
. Microsoft cloud design patterns : https://docs.microsoft.com/en-us/azure/architecture/patterns/

. Microservices : https://microservices.io/

. Books :

Building Microservices — Sam Newman
Domain Driven Design — Eric Evans
Implementing Domain Driven Design — Vaughn Vernon
Designing data-intensive applications — (Must read book for System Design)

Designing a system with right Data Store

Written by Bhagwati Malav