Distributed Datastore

Jolly srivastava
System Design Concepts
5 min readDec 25, 2020

Previously, In this series, we have already discussed scaling concepts pretty well.

Let’s learn about distributed datastore.

In the previous blogs, we have already seen and discussed that if we have our database hosted on one single server we will end up with a single point of failure and issues like high latency and more load on a single server. A better option is the “Master-slave” architecture where all the write always happens to the master and read happens from the slave. The data from the master gets sync with the slave asynchronously.

Master-slave architechture

Consider a situation where we have updated the value of variable ‘y’ to 10 i.e y=10. Master will update this value to slave DBS asynchronously. What will happen if I try reading the value of ‘y’ before the master updates the value of ‘y’ in slave DBS. We will get inconsistent data. so consistency is one of the issues with master-slave architecture.

The next strategy to scale dB is sharding.

In this technique, we have multiple DB and we shard our DB on a particular key. Figuring out the sharding key is a very important step. As shown in the above image, let’s say, I have shard my data on the basis of username, all the users with the first name ‘A-Z’ are on db1 and N-Z are on db2. Here, we have now 2 instances (We can have any number of DB instances running). we have scaled read and write capacity by 2 times. What happens if most of the names start with A and M, db1 become a hotspot. we then end up scaling db1 resulting in a tedious job. Another issue here is to perform join here requires a network call.

Time to bring CAP theorem.

In reality, we can’t have all these three properties at once. we can pick only two.

These days practice shows that we can leave consistency(C) and gives more concentrate more on Availability(A) and Partition tolerance(P). For e.g: Instagram: Here A and P is important, we want our system to be available always. It’s okay if it takes some time if my data is not visible to my friend for a few Milli sec. This system is known as an eventually consistent system i.e after a few Milli secs our data is consistent across all the nodes.

How distributed datastore works?

Node/s

Responsibility: Each node is responsible for storing part of the data that we want to save in our database. Every node is treated equally (no master-slave concept applies here). All these nodes save some part of the data and data backup of some other node. If in case any one node goes down we have part of it saved in other nodes i.e we have that particular node’s backup stored in other nodes which ensures even if a machine goes down we have our data save. this is known as data partitioning. Physically nodes can lie anyone like three nodes can be in one continent and two in another.

Before discussing rebalancing, let us get introduced to yet another important component which is “consistent hashing”.

Just like a primary key in RDBMS, every NoSQL DB has its own key known as partition key which decides where this data go depending on the hash value of this partition key.

For eg: As depicted in the image below i.e Hash ranging from 1–1000 sits on node 6.

Inserting the data with partition key value ‘1’

Let’s say we want to insert a data with partition key ‘1’ and value as ‘data’. The hashed value is ‘1200’. All the data with the hashed values ranging from 1000–2000 get stored in node-6 (n6). Hence this data will get stored in node 6. This is the primary responsibility of a node to store the data with the designated hashed value.

On the basis of the value of the replication factor, the data get replicated on other nodes. if the replication factor is 3, we have to back up the data in two other nodes. one strategy to keep backup is in the next available nodes i.e 4 and 5 will keep the copy of ‘data’.

Rebalancing:

How to scale this cluster up? We can add any new nodes in the cluster, the other nodes in the cluster will learn about new nodes by a process known as “gossiping”.Every node in the cluster knows about every other node and their responsibility. This is how they are up to date about other nodes. If one of the nodes goes down, gossip messages from that down node won’t be sent to any other nodes and they get to know that this node is no longer present in the node. They automatically reshuffle the range and data.

I have covered only the basics of distributed datastore here. For more detail, one should really read about Cassandra. I will try to come up with a new blog about Cassandra covering it in detail.

Till then stay tune!!!

--

--