Cassandra: Distributed key-value store optimized for write-heavy workloads

Ameya

Published in

Coinmonks

6 min readAug 20, 2018

Apache Cassandra Logo

Introduction

Cassandra is a popular distributed key value store, built initially at Facebook using commodity severs for allowing users to search through their inbox messages. While TAO, which i covered here, was optimized for reads, Cassandra is optimized for write heavy workload while maintaining a good performance for reads. Some of the key design objectives of Cassandra seem to be:

While writes are in the orders of billions and need to be optimized for using append-only logs, reads can be very geographically diverse. Hence cross-data center replication is necessary and important for good read efficiency.
Scaling of platform by adding commodity servers, as more users get added to the platform.
Fault tolerance should be the norm.
Giving clients of the database a simple interface and more control over type of schema, availability and performance etc.

Google’s bigtable also serves somewhat similar overall purpose. But the key difference here is that bigtable was built on top of GFS which provides durability of data. Durability of data is built into Cassandra via explicit replication mechanisms.

Data Model

Cassandra like BigTable provides a fairly simple data model. It consists of small keys of tens of bytes in size and some structured format of client’s liking. Keys are mapped to a sets of columns called column families. This way Cassandra data model can be viewed as multi dimensional key-value map.

In cassandra, any row mutation is atomic. So when updating a key and corresponding columns/column-families, all the data is written or none. Columns can be sorted via time e.g. latest messages can be at the top.

The simple API(and fairly self-explanatory) for accessing Cassandra is:

insert(table, key, rowMutation)

get(table, key, columnName)

delete(table, key, columnName)

Request Flow or a key

In Cassandra, for both reads and writes, any end-user request can land on any node. This node is then aware of the replicas that “own” this keyspace. Then the node will route the request to these replicas.

In case of a write operation, replication is important for durability of the data. Hence the node where the request initially landed waits to establish quorum from the responses from replicas.

For reads, this requirement can be relaxed depending on the requirements of the application. For a client app, not too worried about strong consistency, can respond back when the first response arrives from the replica or can wait for establishing the quorum.

Architecture

Any large distributed storage system needs to worry about Persistence or Durability of data, Fault tolerance, Partitioning of the data, Addition or removal of nodes from the system, Scaling. Let’s talk through some of the important aspects. One key fact to keep in mind is that all nodes in Cassandra are aware of each other via Gossip based protocols.

Partitioning — Mapping keys to Nodes/Servers

Cassandra needs to be able to scale by adding more servers and also needs to adjust to failures of nodes without compromising the performance. Hence Cassandra uses consistent hashing for mapping keys to Servers/Nodes. The advantage of consistent hashing is that if a node is added or removed from the system, then all the existing key-server mapping are NOT impacted like in traditional hashing methods. Only the neighbor nodes are impacted and the redistribution of keys occurs among neighbors.

In this scheme, each node gets assigned to some keyrange. Assume that the entire keyrange can be mapped onto the circumference of a circle. So when a key arrives, it gets mapped to some point on the circumference of the circle. Then the algorithm walks clockwise until it finds the next node that is mapped onto the circle. This node is now the coordinator for this key. When this node goes down for some reason, the next clockwise node on the circumference will become the coordinator for the related keyrange. Since different nodes on the circle will have different load, Cassandra allows for lightly loaded nodes to move closer to the heavily loaded nodes.

Replication — Ensuring durability of the data

The coordinator node for the given key, mentioned in the last section, is responsible for replication of data on different nodes. The coordinator will store the key on itself and also other n-1 replicas. There are different schemes for selecting these replicas. For example, for availability and scalability, one may want to have a replica outside of the data center where coordinator resides or they may want replica outside of it’s rack. This policy is dictated by the application.

Cassandra elects a leader using zookeeper. The leader is responsible for putting a node onto a circle. Cassandra also informs nodes, which keyspaces are they replicas for. Generally it is preferred, for better availability, to have replicas stored across multiple data centers connected via high speed links.

Cluster membership

Cluster membership is established using Scuttlebutt protocol — It is efficient CPU wise and also doesn’t place a heavy burden on the channel of communication.

For detecting faults and nodes that are down, Cassandra uses “accrual failure detection”. The basic idea being that it gives a probability for a node being down given some threshold. An aggressive threshold would look for smaller timeouts, while a conservative threshold may look for longer timeout to nodes that are down. For this to work, every node maintains a sliding window of inter arrival time of messages exchanges via gossip protocol. This kind of scheme adjusts very well to the dynamic load conditions.

How does a new node join the cluster?

A new node gets assigned a random position on the circle. Any new node that comes up is instantiated with a few existing seed nodes that it can contact. Using these and via gossip the new node learns about the other nodes and other nodes also learn about the new node. This is why any node is able to route the request for a key to its rightful owner.

Scaling the cluster

When new nodes are being added to the cluster, it makes sense to introduce them near the most heavily loaded nodes. This leads to data redistribution from the loaded node to the newly added node. Apparently this done via kernel file copying mechanisms — presumably using sendfile.

Read/Write performance improvements

Let’s look at some of the ways by which read and writes are optimized in Cassandra.

Write optimizations: Typically best way to optimize disk writes, is by doing sequential writes to a log like in a log structured file system. So a write involves writing to a commit log on a dedicated disk and then updating the in-memory structure. Once certain thresholds are reached, the data structured can be dumped to disk and the corresponding indices can be generated.

Read optimizations: First point of reference is memory for reads. Then further optimizations using bloom filter to avoid disk lookups. There is also some locality with key and the columns. But in case of column families, certain columns might be further away — this is optimized using a column index.

Journaling

As we saw in the write optimizations section, Cassandra uses a log for recording operation for increased durability. In such systems, when a machine comes up it starts scanning the log and the last saved state of the system and rebuilds a more current state based on entries recorded in the log. In such cases, obviously the log needs to be purged every so often. In cassandra, this is done based on the size. When a log reaches 128MB, a new log begins. Previous log can be deleted, as long as data pertaining to the past commit log has been dumped to the disk.

Conclusions

I found this paper to be a fairly good read. Cassandra is one of the most widely used key-value store and learning more about was useful. I also found that the system is put together pretty well using various techniques available on the systems side of things. Gossip techniques like scuttlebutt also sound very interesting and i may cover it in the upcoming posts.

Get Best Software Deals Directly In Your Inbox