NoSQL-Apache Cassandra Architecture

Published in

Analytics Vidhya

7 min readAug 22, 2020

We all started learning about Database with a formal definition, right!
“ A database is a collection of data that is organized so that it can easily facilitate the storage, retrieval, modification, and deletion of data in conjunction with various data-processing operations.”

But why knowing Database architecture is so crucial?

The architecture in the database focuses on software design, maintenance, and management that handles the actual data underlying the program. The right decisions about design would have a beneficial impact on database efficiency and on the scale-up capability.

Can we dig deep to know about Cassandra Architecture in this article! Yeah, before that let's understand about basics of NoSQL, types, and CAP theorem.

NoSQL

We often hear the term ‘non-relational database’ which is nothing but NoSQL. Some say the term “NoSQL” stands for “non SQL” while others say it stands for “not only SQL.”

NoSQL data models allow related data to be nested within a single data structure, so related data doesn’t have to be split between tables. They can easily process structured, semi-structured, and unstructured data.
NoSQL database with a masterless, peer-to-peer architecture with all nodes being the same and guaranteed consistency to ensure constant availability. This offers easy scaling to adapt to the data volume on the applications. And also Zero downtime because data will be distributed with multiple copies across different nodes.
Instead of the ACID (Atomicity, Consistency, Isolation, and Durability) properties, NoSQL systems are said to have BASE ( Basically Available, Soft state, and Eventually consistent) properties.

Types of NoSQL Databases

Database Types Image Source: https://www.improgrammer.net/most-popular-nosql-database/

Document databases store data in documents similar to JSON (JavaScript Object Notation) objects. Each document contains pairs of fields and values. MongoDB has been considered as the world’s most popular Document database according to DB- engines.
Key-value databases where each item contains keys and values. A value can typically only be retrieved by referencing its value. Common use cases include storing user preferences or caching. Redis and DynanoDB are popular key-value databases.
Wide-column databases store data in tables, rows, and dynamic columns. Wide-column stores provide a lot of flexibility over relational databases because each row is not required to have the same columns. Commonly used for storing Internet of Things data and user profile data. Cassandra and HBase are two of the most popular wide-column stores.
Graph databases store data in nodes and edges. Commonly used when we need to traverse relationships to look for patterns such as social networks, fraud detection, and recommendation engines. Neo4j and JanusGraph are examples of graph databases.
Nodes → store information about people, places, and things
Edges → store information about the relationships between the nodes

What is the CAP Theorem?

CAP theorem is also named as Brewer’s theorem, which was proposed by Eric Brewer. The distributed databases are based on the CAP theorem.

The theorem states that “Though it’s desirable to have consistency, high availability and partition tolerance in every system, unfortunately, no system can achieve all three at the same time.”

• Consistency: — All database clients will read the same value for the same query, even given concurrent updates.
• Availability: — All database clients will always be able to read and write data.
• Partition Tolerance: — The database can be split into multiple machines; it can continue functioning in the face of network segmentation breaks.

Cassandra Architecture & Replication Factor Strategy

Apache Cassandra is an open-source, distributed, NoSQL database. It presents a partitioned wide column storage model with eventually consistent semantics.

Let's start with the basic components of Cassandra, so that makes our learning easy and interesting.

Node: The data will get stored in a place called Node. We can say in other words, a single Cassandra instance is called a node.
Datacenter: The group of nodes is called a Datacenter.
Cassandra can scale both horizontally (adding more datacenters) or vertically (adding more nodes).
Cluster: The group of Datacenters is called a cluster.

Database Structures

Cassandra stores data in tables where each table is organized in rows and columns the same as any other database. Cassandra's table was formerly referred to as a column family. A keyspace could be used to group tables serving a similar purpose. So here, to make it easy, we can understand that from a business perspective, all transactional tables or metadata tables or information tables can be called under one keyspace.

Apache Cassandra’s structure is “built-for-scale” and can handle large amounts of data and concurrent users across a system. With no single point of failure, the system offers true continuous availability, avoiding downtimes and data loss by Data Replication.

Data replication is defined per keyspace in terms of
→ replication factor per data center
→ the replication strategy

Replication Factor
The total number of replicas placed on different nodes is determined by the replication factor.
1 replication factor = only a single copy of data
3 replication factor = three copies of the data on three different nodes

The remainder of the replicas is placed by Cassandra on specific nodes using a replica placement strategy.

Replication Strategy
The replication strategy is set at the keyspace level. There are two strategies: 1. SimpleStrategy- Used for temporary and small clusters. Data replicas on nodes are placed sequentially.
2. NetworkTopologyStrategy- Used when there are more than two data centers. Replicas are set for each data center separately.

What is a Coordinator?

The node is chosen by the client to receive a particular read or write request to its cluster.

Any node can coordinate any request
Each client request may be coordinated by a different node
The coordinator manages the Replication Factor — onto how many nodes should write be copied?
The coordinator also applies the Consistency Level — how many nodes must acknowledge a read or write request

What is partitioner and how it works?

A range of tokens is defined for each node in the Cassandra cluster.

The partitioner is the factor responsible for deciding how the data can be spread across cluster nodes. Cassandra distributes data across the cluster using a Consistent Hashing algorithm, given the partition key of a row.

Let’s consider a simple example: suppose a request is issued to node B (that is, node B is the coordinator for this request) with a row containing the partition key “data science”. Suppose the partitioner applies the hash function to the partition key “data science and gets the token 87. Node C token ranges include 87, so this node will be the one handling the request.

Cassandra offers three partitioners:
Murmur3Partitioner (default), RandomPartitioner, ByteOrderedPartitioner

How are virtual nodes useful?

Where each node holds a large number of small token ranges to enhance token reorganization and avoid cluster hotspots, that is, some nodes store far more data than others. Virtual nodes also make it easier to add and remove nodes in the cluster and automatically manage the token assignment so that instead of measuring and assigning new token ranges for each node.

Write Path

Commit Log: To ensure data protection and data integrity, Cassandra has a backup method called commit log-all data is written to the commit log to ensure data is not lost
Mem-table: After data is written to Commit log, the data is then indexed and written to a mem-table. There is one active mem-table for every table
SSTable: Data is flushed onto a disk when mem-tables hit their limit and become immutable SSTables. Very specifically, this means that when the commit log is full it will cause a flush where the mem-tables data is written to SSTables

Read Path

There are three types of reading requests that a coordinator sends to replicas: Direct request
Digest request
Read repair request

The coordinator sends a direct request to one of the replicas.
Then, the coordinator sends the digest request to the number of replicas stated by the consistency level and examines whether the returned data is an updated data.
The coordinator subsequently sends digest requests to all the remaining replicas. If any node gives out of date value, a background read repair request will update that data. This process is called a read repair mechanism.

So far in this article, we have understood how Cassandra architecture is designed in a specific way to deliver scalability, reliability, and efficiency. The complex architecture of Cassandra necessitates careful configuration and tuning. To be using Cassandra effectively it is crucial to understand the components. When you need to store and handle vast volumes of data through several servers, Cassandra may be a good option for it. This is suitable for companies that are unable to afford to lose data or that are unable to access their database due to a single server outage.