Understanding Distributed database/system using Cassandra
When I first came across the term distributed systems or database, the very reaction that came to my mind was that there are more than one computer where the data will be stored in order to make sure that even if one computer goes down, my data will be present and my application will keep running.
But before that there comes few questions.
- How does my one application connect to multiple computers at once?
- How does my application know that if one system is down another one could be used?
- How does the data will be synced across those computers or will be consistent?
- How does the data which I am gonna update will reach all the computers at once?
- Do I really need to copy all of my data in all the computers?
And so on….
If the answer to any one of these questions seems ambiguous to you, then bear with me for some time. My attempt in this article will be to clear all such questions for you.
This is what you get when you google what a distributed system?
A distributed system is a system with multiple machines that communicate and coordinate actions in order to appear as a single coherent system to the end-user.
There are multiple benefits of using such systems such as, Reliability, Horizontal Scaling, Performance and more.
Let’s see how all this works using cassandra. If you have ever used MongoDB or MySQL, cassandra is similar to those. It’s a database engine which is distributed by nature. It has its own Query language very much similar to SQL.
So, if you have a website say facebook.com, it needs a database with proper availability, consistency and performance. We can’t use a single mysql server and expect our feeds to load seamlessly in India when that mysql DB is sitting in the USA.
So, in such cases we will use something like cassandra where few nodes(computers) will be sitting in India and others in the USA or UK.
But how will this work?
Our application, let’s say nodejs server or spring boot server will connect to cassandra database.
The programmer will say to cassandra(configure) that I have 5 data centers and in each data centers, I have 10 nodes(computers).
All these configurations occurs at 2 places. Either at the seed node(main node) of the cassandra which keeps information about other nodes or in our server side code while creating the connection with the Database.
How to setup cassandra to use 5 data centers and in that multiple nodes is a whole new story of creating a multi cluster environment or we can use services provided by cloud services like AWS.
Suppose you go to facebook.com and upload your latest pic with a caption. Once you click on the upload button, the data will come to your server and it will send that data to cassandra.
Now, Cassandra will check the configurations and will observe that you want it to store this across all data centers and in at least 3 nodes. (because we need to replicate data to ensure availability).
Suppose there are 5 nodes in a data center, Imagine those nodes(computers) in a circle equally separated. Cassandra assigns a range of numbers to each node in a cluster. And once data comes, it calculates the hash of that data and gets a value and then it checks in which node range this value belongs. It then stores that data in the respective node and also copies that data in say 3(Replication factor which we set) more nodes in clockwise direction. This ensures that the system is Reliable (facebook.com will work even if one node goes down) and also we can scale our system horizontally (by adding more nodes).
How does Cassandra ensure consistency?
Again, we as a programmer configure in the database engine that we need a consistency level(any, quorum).
So whenever we wish to get the data from cassandra, it will fetch the same data from each of the replicated nodes and return the data according to the configuration, for instance latest data (timestamp) if at all there is any inconsistency.
If you are still confused about how reading and writing takes place in Cassandra, read on.
In Cassandra, we make tables but there is a slight difference than that of relational tables. Suppose we have an employee table like this.
In cassandra, while creating this table we specify 2 things:
1. Partition Key which is responsible for data distribution across your nodes.
2. Clustering Key which is responsible for data *SORTING* within the partition (on node)
Both these keys are part of table only i.e. one of its column.
So, if in this table we set “subject” as partition key and we have 2 nodes then,
Teacher-1 and Teacher-3 will be stored in node1 (since both belongs to “Database” partition key) and
Teacher-2 and Teacher-4 will be stored in node2.
And these data will be sorted by clustering key(here id).
Now, whenever a write operation will come, cassandra can easily decide in which node it has to store the data, it will see the value of its partition key, and then calculate the hash and hence will figure out the node information.
Also, while reading we need to provide these keys in order to let cassandra calculate the node information.
So, this was the case in which we understood how a distributed database works, but now-a-days every system is deployed in a distributed manner. You can read more on how distributed caching works and how servers are deployed in a distributed manner and serves requests or balances the load between those servers using load balancers.
I hope this answers all the questions which I mentioned above.
I am very new to this, your suggestions and doubts are welcome.