The Distributed Architecture Behind Cassandra database

Bruno Tinoco
10 min readJan 20, 2017

--

An introduction about Apache Cassandra database architecture

When a client asked me to start research about a NoSQL database one year ago I couldn’t imagine how rich would be that experience in terms of technical breadth and depth. After some research we discovered the Apache Cassandra would fit the requirements we were looking for and that’s why I would like to share some technical information about my journey learning about Cassandra.

The NoSQL world and Cassandra’s born

The database management software world has change some time ago driven mainly for high-tech companies that handles huge amounts of distributed data over clusters of commodity server machines and that needs to face the common availability issues to attend high volume of simultaneous users. For that reason, Facebook engineers decided to create a new solution for their user’s inbox search problems and compose a new distributed storage system using the best features of two other existing software from Amazon (Dynamo) and Google (Big Table). The idea was to create a new system for managing structured data that is designed to scale many servers with no single point of failure to overcome common services outages and avoid negative impact to end users.

So Cassandra was designed to fall in the “AP” intersection of the CAP theorem that states that any distributed system can just guarantee two of the following capabilities at same time; Consistency, Availability and Partition tolerance. In this way Cassandra is a best fit for a solution looking for a distributed database that brings high availability for a system and also is very tolerant to partition its data when some node in the cluster is offline, which is common in distributed systems.

Figure 1. [Databases according to the CAP diagram]

Basic data structure
Cassandra is classified as a column based database which means that its basic structure to store data is based on a set of columns which is comprised by a pair of column key and column value.

Every row is identified by a unique key, a string without size limit, called partition key. Each set of columns are called column families, similar to a relational database table.

The following relational model analogy is often used to introduce Cassandra to newcomers:

Table comparing Relational vs Cassandra terms

This analogy helps make the transition from the relational to non-relational world. But don’t use this analogy while designing Cassandra column families. Instead, think of the Cassandra column family as a map of a map: an outer map keyed by a row key, and an inner map keyed by a column key. Both maps are sorted.

SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>

Column key/value structure

Why?
A nested sorted map is a more accurate analogy than a relational table, and will help you make the right decisions about your Cassandra data model.

How?
A map gives efficient key lookup, and the sorted nature gives efficient scans. In Cassandra, we can use row keys and column keys to do efficient lookups and range scans. The number of column keys is unbounded. In other words, you can have wide rows. A key can itself hold a value. In other words, you can have a valueless column.

Cassandra cluster topology
A Cassandra instance stores one or more tables according to the user definition. The common topology for a Cassandra installation is a set of instances installed into different server nodes forming a cluster of nodes also referenced as the Cassandra ring. Each node in the ring is responsible to store a copy of column families defined by the partition key and replication factor configured.

Figure 2. [Cassandra ring with 3 nodes and key distribution]

The figure 2 shows a Cassandra ring with three nodes storing different keys that are calculated through a hash function in Cassandra to decide the location of data and its replicas. The default configuration for the replication factor is 3 which means that each data stored on node 1 will be also replicated (copied) to the nodes 2 and 3.

The colors in the ring represents the set of keys stored in each node according to the range number returned by the hash function. In this example all keys from 1 to 38 will be placed on node 1, so the key 1 will be stored on node 1 and one copy of it will be stored on node 3 and 2 considering a replication factor configuration of three.

If some application need to access the key 1 but node 1 is down, then Cassandra will try to get a copy of it from nodes 3 or 2.

Another import aspect is that there is no master node on the cluster, each node can act as a master, known as coordinator node. This happens when a client connects to any of Cassandra nodes then it acts as the coordinator and that node will be responsible to read or write data from/to the right nodes that owns the keys.

It is also worth to mention that Cassandra also supports specific configuration for data center deployments so that you can specify which nodes will be located in the same data center and even the rack position. This configuration helps to increase the level of high-availability and also to reduce the read latency so that clients can read data from the nearest node.

For this last feature there is a specific configuration called “Network Topology Strategy” defined on keyspace definition.

Figure 3. [Cassandra cluster configured with 8 nodes deployed in different datacenters]

Node communication
All Nodes in Cassandra communicates with each other through a peer-to-peer communication protocol called Gossip Protocol that broadcasts information about data and nodes health. One interesting thing about this protocol is that it in fact gossips! I mean, one node doesn’t need to talk with all nodes to know something about them. To avoid the communication chaos when one node talks to another node it not only provides information about its status, but also provides latest information about the nodes that it had communicated with before. Through this process, there is a reduction in network log, more information is kept and the efficiency of information gathering increases.

Figure 4. [Gossip peer-to-peer communication between Cassandra nodes]

The gossip protocol is also used to failure detection it behaves very like TCP protocol trying to get an acknowledge response before consider a Node is up or down. Basically, when two nodes communicate with one another; for instance, when the Node 1 sends a SYN message (similarly to the TCP protocol) to the Node 2 it expects to receive an ACK message back and then send again ACK message to Node 2 confirming the 3-way handshake. If the Node 2 doesn’t reply the SYN message it is marked as down. Even when the nodes are down, the other nodes will be periodically pinging and that is how the failure detection happens.

This protocol is also important to provide the client’s driver information about the cluster to allow it choose the better available Node to connect to in order to load balance connections and find the nearest and fastest path to read required data.

Tunable Consistency
Now that we covered the overall architecture in which Cassandra is built on, let’s go deeper into details about how all data is written and read. We already commented that Cassandra if focused on high availability and partition tolerance, but it doesn’t mean that there is no data consistency. In fact, the consistency level on Cassandra is tunable by the user.

You can choose from low to high level of consistency. The common terms used for both read and write data are ONE, QUORUM and ALL.

This means that considering the default replication factor of three (3) defined for the tables of a keyspace and a consistency level of ALL, one write operation on Cassandra will wait for the data be written and confirmed by all 3 nodes before reply to the client. QUORUM consistency means majority of nodes (N/2+1).

In a cluster perspective when a client connects to a Node to write some data, it first checks which node the partition key of that data belongs to and then the coordinator node which the client is connected to sends that data to the right node that should store that key, depending on the consistency level defined by the user (Consistency Level of ALL) the coordinator waits all nodes respond to the request before reply to the client. So it is up to the user define which consistency level is suitable for each part of the solution.

Figure 5. [Client communication path in a Cassandra cluster]

Write and Read Path
In a single node perspective when a client requests to write data in a Cassandra node, the request is persisted on a commit log file on disk and then the data is written in a memory table called memtable. When the memtable is full, after reaching a preconfigured threshold, it is flushed to disk in an immutable structure called SSTable. Each table on Cassandra has a respective memtable and SSTable.

Figure 6. [Cassandra Node write path]

Internally each Cassandra node handles the data between memory and disk using mechanisms to avoid less disk access operations as possible and for do that it uses a set of caches and indexes in memory to make it faster to find the data on right location. One of them is the Bloom Filter which is an in memory structure that checks if row data exists in the memtable before accessing the SSTables on disk. Other is the Partition Index that stores a list of partition keys and the start position of rows in the data file written on disk. This process if represented by the Figure 7.

Figure 7. [Internal node read path]

Data Modelling
My intent in this article is to focus on the architecture building blocks of the Cassandra, but I would like to add a comment about how data modeling works in Cassandra. This is one of the hardest parts in working with Cassandra, mainly because it is paradigm shift from the well-known relational database world.

The first thing you as a Cassandra newcomer user note when you start working with it is the lack of “JOINs” between tables. Then you start thinking how could you model your business problem domain entities to reach the desired solution? The answer is that you don’t model based on key entities and its relationships in the way to normalize the data, but you need to model based on the queries your application need to fulfill its user interface demands, creating a de-normalized model. So if you need to create new visions of the same data, the recommended practice is to create a new table (column family) for it. This technique is called Query Based Modeling. First you think about the queries that you need to execute and then you model the tables based on it. Yes, you will end with duplicated data stored, but the reason for that is you are trading disk space for read performance, in fact disk space is cheaper.

Conclusion
One of the things that makes me impressed about Cassandra is the level of configuration options available to tune its behavior to fit in your solution so as a distributed database it is prepared to bring to you a high level of system availability with no single point of failure. Also it allows you to have low latency for write data and you can find some detailed benchmarks with other NoSQL products on the internet.

Cassandra Node Data Read and Write process details

You may have heard of Apache Cassandra and find it interesting to use it in your project, I recommend you first to evaluate your business requirements and verify if your project demand the use of this type of database management system otherwise you may face many difficulties of implementation that could be solved using traditional relational databases. Keep in mind that Cassandra was created to solve specific problems of availability and speed for write and access large volumes of data. Some use cases have been tested and are also well addressed by Cassandra as time series data storage, immutable events persistence and for analytical database. In the financial industry there are companies using Cassandra as part of a fraud detection system.

So my advice for those thinking about to use Cassandra I recommend start the reading of learning resources available on community websites and from companies that is supporting the Cassandra development as the DataStax and then proceed with a proof-of-concept with a small cluster and with a specific use case in mind. In parallel research about successful implementation cases using Cassandra as a distributed persistence storage, this for sure will help you to take clear and assertive decisions to build a good solution.

References

Apache Cassandra website
Planet Cassandra Community
Datastax website

--

--

Bruno Tinoco

Software Architect, Consultant, Engineer and enthusiast of people working in the Information Technology