A Glance at Apache Cassandra

Halil Ertan
12 min readApr 25, 2020

--

Apache Cassandra is a kind of distributed column-oriented NoSQL database and performs very high performance in vast amounts of data with correct usage. It was initially designed at Facebook to implement a combination of Amazon’s Dynamo distributed storage and replication techniques combined with Google’s Bigtable data and storage engine model. I will basically mention internal working principles and some key points which should be taken into account in a simple manner.

Photo by Zosia Korcz on Unsplash

Cassandra is a distributed DB and is generally designed as a cluster that consists of many nodes. Every node in Cassandra is independent, it has a decentralized structure. Cassandra stores data in partitions (partition keys in Cassandra tables are determiners) in a distributed manner. Within a partition, rows are stored in the order of their clustering keys.

A key goal that you will see as you begin creating data models in Cassandra is to minimize the number of partitions that must be searched in order to satisfy a given query. Because the partition is a unit of storage that does not get divided across nodes, a query that searches a single partition will typically yield the best performance.

Cassandra can be used in multiple data centers which are generally located in different regions as well, and similarly, each data center can consist of multiple racks which consist of clusters of nodes. For simplicity in this writing, we can think of a Cassandra cluster which is composed of one data center and one rack.

Cassandra embraces peer to peer approach, and all nodes in a Cassandra cluster are equal. The node that receives the request is called as coordinator node. This node directs the read/write request to related nodes. Every node in the cluster shares snitches with other nodes by gossip protocol every second periodically. Snitch includes network topology and helps the coordinator node route requests efficiently. It additionally spreads replicas (I will mention replicas later) around your cluster to avoid correlated failures, for example not two replicas on the same node. Thanks to the gossip protocol, each node shares several kinds of data like the status of the node, location of node(snitch keeps), schema, load, and severity on nodes, with a couple of neighbor nodes, and then iteratively each new node shares similar data it has and previously received data from other nodes, with another small set of nodes. So, all nodes in the cluster are aligned and keep up-to-date information.

Differences from a relational DB

The difference between a relation DB and Cassandra is one of the first things to mention since I observed that most of the misusage of Cassandra derives from that point regarding that it has a query language CQL which is very similar to SQL. The differences are the following:

-No joins
-No Foregin keys
-Denormalize structure unlike relational DB
-Query-first design
-Designing for optimal storage
-Sorting is a design decision

Cassandra is actually too far away from a relational DB. It does not even support prominent properties of relational DBs like joins and foreign key relations. It encourages the denormalization of the data unlike relational DBs, which means it duplicates the same data in different tables in terms of the queries. In a relational DB, you are mostly not interested in where the data will be stored, however, it is a primary concern for Cassandra. The most critical point to keep in mind is to design your Cassandra tables according to your queries in practical usage. Otherwise, surprisingly you can not run even very simple queries efficiently in Cassandra. Moreover, you can not order your data for every column in your table. First, you should decide on which columns you will execute transactions and design your table accordingly.

CQL

CQL is the query language of Cassandra, which has a very similar syntax to SQL, but very limited abilities compared to SQL.

Figure 1 — Create Cassandra table and insert rows

In Figure 1, it can be seen how to create a table and insert new rows. As I said before, it is very similar to SQL syntax at first appearance. I created a very simple table that basically specifies the shopping activities of people in a particular month. I assume that I will mainly query customer names in my script, therefore I determine the primary key as “CUSTOMER_NAME” and “MARKET_NAME”. While the latter is named as partition key, the former is named as clustering key. They form the primary key together. Partition keys and clustering keys can be defined as compounds with the help of additional parentheses. Determining the primary key is crucial considering the query-first design essential. If I focus on what people are buying, then I would change the partition key to “ITEM_NAME”. Totally different from relational databases, I would create and keep two different tables that contain exactly the same data if my concern was about customer name and item name at the same time.

Cassandra stores data in partitions in an ordered way stated in the clustering key. In our example, each customer name represents one partition, and data is stored in partitions with ordered market names. ‘test’ is the keyspace name in Cassandra, which is equivalent to a database or schema in relational databases.

Figure 2 — Select, where, order by examples in Cassandra

Cassandra stores unique rows according to specified columns in the primary key. When a row with the same customer name and market name is inserted in our example, it does not insert a new record, instead updates the internal record with the new values as shown at the beginning of Figure 2.

Another important point to say, you can use a where clause with a column in the clustering key as long as columns in the partition key are present in the where clause. However, any columns other than the columns in the primary key are not allowed for filtering.

If you insist on a query with a column not exist in the partition key, you can use the “ALLOW FILTERING” pattern at the end of the query. However, you should be aware that you are performing the wrong query regarding the design of the table if you are using “ALLOW FILTERING”.

Ordering is also a design issue and is completely different from relational DB queries. If you are willing to use order, not only you have to include a where clause of the partition key, but also you look out for the ordering of columns in the clustering key.

Replication Factor & Consistency Level

To comprehend the replication factor and consistency level in Cassandra is very important to get insight into Cassandra architecture. They are closely related concepts. In Cassandra, data is replicated among the different nodes in a cluster in order to carry out availability in case of node failure. There are mainly two types of replication strategy in Cassandra, which are simple strategy and network topology strategy. It is stated when a new keyspace is created. While network topology strategy is used for multiple data centers and racks, simple strategy is implemented in simple clusters consisting of multiple nodes. The main goal is to store data in different nodes in a simple cluster and store it in different data centers and racks in more comprehensive designs. For instance, if we set the replication factor as 3, this means that the same data is stored in 3 different nodes to maintain the availability of Cassandra in a situation of node failure.

For reading and writing operations, there are a number of consistency-level options in Cassandra. Consistency level directly affects the reading and writing performance of Cassandra. You can set different consistency levels for reading and writing. For instance, let's assume the replication factor is chosen as 3, if we select the consistency level as ‘ONE’, a successful writing response will be returned to the client when the data is written in at least one of the nodes. Next operations can continue, meanwhile, data will be written to the other 2 nodes for keeping the replicas of data. From a reading perspective, reading only from one of the 3 nodes which store the data is enough to send a successful reading message to the client. If the consistency level is chosen as ‘TWO’, the data must be written to at least 2 nodes and the data must be read from at least 2 nodes in order to send successful writing and reading messages to the client.

It is obvious that keeping the consistency level low increases the reading and writing performance of Cassandra, but not without compromising. Write operations are always sent to all replicas, regardless of consistency level. The consistency level simply controls how many responses the coordinator waits for before responding to the client. Cassandra makes a commitment to eventual consistency. However, keep in mind that all nodes might not have up-to-date data at a certain time. Have a look at the following figure. After writing the data, what happens if a reading operation is performed for the same data while writing for the other two nodes is not completed yet? Cassandra can not guarantee the consistency of the data in such a situation. In other words, it might return a wrong result (older data in this situation) to the client.

Figure 3 — A typical erroneous scenario with a low consistency level

What is the solution to tackle these kinds of scenarios? ‘QUORUM’ consistency level states that write or read operations have to be performed on the majority of replicated nodes before returning to the client. It is calculated as (RF+1)/2. For the scenario above, it is calculated as 2. Write operation will be performed at least on two different nodes. And, we will perform the reading operation from at least two different nodes one of which absolutely stores up-to-date data, so guarantees to return new data. While reading from two different nodes, Cassandra takes into consideration the timestamp information of the data as well. If one of them has older data, Cassandra returns the data with recent timestamp to the client and issues a read repair command on the node which has old data.

Consistent Hashing & Token Ring

Cassandra partitions the data according to the partition key of the Cassandra table and stores the data with the same partition key on the same node. In order to decide where to store each partition, a hashing strategy is used. For the same partition key, the hash function generates the same token values. Each node has a token range, and data is stored according to the result of the hash function(token). However, two main problems come up while distributing data among the nodes. One, distribute data among the nodes in a balanced way. Two, minimizing data moving in a situation of node adding or node removing from the cluster. To address these issues, a consistent hashing strategy and virtual nodes are introduced in Cassandra, and architecture is visualized as a ring. It is depicted in the following figure.

Figure 4 — Token Ring

If one node is removed, data in the removed node is placed on the next neighbor node in a clockwise manner. Similarly, some part of the data is taken over from the previous node by a newly added node. Within consistent hashing, only k/n keys need to be remapped where k is the number of partition keys and n is the number of nodes on average.

Figure 5 — Adding and removing a node from Cassandra Cluster

To balance the loading on the nodes, virtual nodes are introduced in Cassandra. Instead of assigning one big large token range to a single node, multiple smaller token ranges are assigned to each node on the token ring (multiple hash functions can be used including partition key and node ID for the purpose). These smaller token ranges are named virtual nodes. It provides great flexibility for Cassandra while distributing the data. By default, each node has 256 virtual nodes. With the help of virtual nodes, adding and removing a node from a cluster, and distributing the data load to nodes proportionally with node capacities can be managed easily.

The token that is the result of the hash function for the partition key is used to determine the node that will store the first replica. The placement of the subsequent replicas is determined by the replication strategy. The simple strategy places the subsequent replicas on the next node in a clockwise manner. The network topology strategy is data center aware and makes sure that replicas are not stored on the same rack.

Figure 6 — Virtual Nodes

Write Path

In a situation of write request, data is written to commit logs in disk, and mem tables in memory simultaneously. While data is stored in commit logs in durable format by not concerning ordering, it is stored in mem tables considering the data modeling in nondurable format. In the case of losing data in the mem table, data in the commit log can be used in order to recreate the mem table. When data is both written to the mem table and commit log, it is returned to the client as successful. After the writing process is completed, data in the mem table is flushed into ss tables on disk, and stored in ordering, durable, and immutable format. As a final step, data in the commit log and mem table are cleared. In order to increase read speed and store the data efficiently and up to date, multiple ss tables are merged into one ss table periodically considering that reading fewer ss tables as far as possible increases reading performance.

Read Path

Read operation can be processed on mem tables and ss tables. Cassandra tables might be stored in a number of ss tables. Therefore, one mem table and a number of ss tables are used for reading operations at the same time. If any data with the same primary key is encountered, then the data with the latest timestamp is returned. This will be the data in the mem table which is not written to the ss table yet.

Reading data on ss tables is a bit complex, Cassandra has various caching and indexing mechanisms in order to speed up the reading operation. It implements bloom filters, key cache, partition summary, and partition index respectively. Bloom filter is something that indicates data is absolutely not in this ss table or might be in this ss table, so it decreases disk IO cost. Key cache stores byte location where the data begins for frequently accessed data. Partition summary and partition key serve similar purposes, they store beginning location of each partition key, so you can directly access the corresponding part of the ss table.

Notes

  • A secondary index can also be implemented in Cassandra for non-primary key columns. Note that you don’t need to use an index if you design your table in the correct way.
  • Materialized views can be thought of as an alternative to secondary indexes considering especially issues while creating indexes on high cardinality columns. It is based on an existing table and extends primary key columns in the base table by defining a new column as the partition key. It is dynamically aligned with the base table created from.
  • Cassandra configuration file is located at conf/cassandra.yaml.
  • Timestamps can be input in CQL using a string that represents timestamp format ‘yyyy-mm-dd HH:MM:SS’.
  • CQL supports 3 kinds of collections: Maps, Sets, and Lists. For instance, you are supposed to keep all phone numbers of a person in the same record. You can facilitate collections and store all phone numbers in a single column.
  • Cassandra nodetool provides several kinds of commands in order to manage and give insight into the Cassandra cluster.
  • CQL supports uuid (Universal Unique ID) as a data type to generate unique values and is mostly used as part of the primary key of a table.
  • Counters which are special kinds of data types increasing or decreasing over time, are also supported in Cassandra but in a limited way. They can not be part of the primary key, cannot be created as an index, and no deletion is allowed on columns defined as a counter type.
  • If you need to store a value in a column for a certain time temporarily, TTL can be used.
  • If you want to find out the last written time of a column, writetime() can be used in a select query like select writetime(“CUSTOMER_NAME”) from shopping_by_customer_name.
  • To change the keyspace, use ‘use’ command. Furthermore, ‘describe keyspace’ and ‘describe tables’ will give information about keyspaces and tables.
  • Cassandra tombstones can cause performance and disk problems, therefore they should be mitigated. The main reasons for tombstones are deletion, inserting null values, inserting into collection columns, and using TTL. They can be thought of as a kind of indicator that state that the row is actually deleted or will be deleted in a certain time in case of TTL usage.
  • To visualize Cassandra, you can benefit from dbeaver.

Useful Links

Apache Cassandra Tutorial

--

--