Tuning Cassandra performances

Published in

Linagora Engineering

7 min readMay 17, 2017

Administrating Cassandra can be felt sometime as war stories…

Recently, we encountered performance problems on our out-of-the-box Cassandra. We managed to tune it a bit and we decided to write this article to share our knowledge about this topic. We discovered a lot of room for improvement on the system administration level.

How Cassandra performs reads and writes?

The way Cassandra manages its data is based on a simple observation: caching can easily improve read speed. This is designed for fast writes, as read can be avoided and optimised with cache. We should manage to have sequential writes and avoid random access.

The way this is achieved with Cassandra is using SSTables. The core concept is that deletes are write operations (writing tombstones), updates are writes. A table is stored on disk on several immutable files called SSTables. Every new writes leads to the creation of new SSTables. So writing is very easy. However, reading might require scanning the entire SSTable.

To even better optimise writes, we can write by batch. Thus we keep data in memory, in a Memtable. When a threshold on the Memtable size is reached, the content of the Memtable is flushed to the SStable. Memtable allow merging several writes on the same key, and when flushed, perform only a single write to the SSTables.

To not lose writes, Cassandra uses a commitlog to replay in case of a crash. Note that commit log is the price to pay for using a Memtable. It’s efficient to write to it because you can write the update instructions in commitlog. You don’t need to write the full row.

Of course, we can not afford full SStable scan on each reads. Several mechanisms are implemented by default in Cassandra to ease read.

The first one is about Bloom filters. Bloom filters is a probabilistic data structure to check for the existence of a key. It allows Cassandra to almost all of the time avoid to read on the disk for keys that do not exist.

Cassandra also has a Key Cache storing on disk location for a specific key. Thus we can save much time when reading it.

Finally, Cassandra has a Row Cache, storing in memory specific values for rows. It manages to keep ‘hot’, frequently access values in it. It allows finer grained caching than file system cache. Row cache can avoid disk reads.

One last concept: compaction! SSTables are immutable. However update and delete heavy workload will generate large SStables. Sometime we will need to remove entries and associated tombstone. Sometimes we want to summarize an entry and it’s updates into one entry. And we also want to have clustering partitions located on the same SSTable. So Cassandra will on a regular basis read and re-write the SSTables to make our reads easier, and save disk space.

Note that all SSTables are stored by default compressed using LZ4 algorithm. Compression is about reducing data size (for less space, better I/O) while compaction is about the way data get reorganized.

Analyse your Cassandra server: useful commands

The first command you need to know is the one allowing you to see table level parameters, that will be discussed later. In CQL:

USE apache_james ;
DESCRIBE TABLE message;

You can there check:
— compression
— compaction
— Bloom filter settings
— Cache settings

To see Cassandra level metrics (global caches results, load, etc…):

nodetool info

To see detailed statistics about your tables, including:
— Read/Write request counts
— Read/Write latencies
— Space on disk
— SSTable count
— Bloom filter statistics

nodetool cfstats

This command with definition of the schema will guide the later, personnalized analysis.

You can also monitor detailed percentile latencies, per table, using:

nodetool cfhistograms apache_james messageidtable

How can I manage this, as an administrator?

The first thing you can do is tune your tables!

Adding row cache

You can enable row caching. It can avoid significant READ load on your disks. First, in cassandra.yaml, define the amount of heap space you want to dedicate to row caching. Then you can activate it per table to set how much data to fit per partition key in the row cache.

Of course, row cache is better suited with read intensive, size limited tables. Eg: on James, this perfectly fits our denormalized metadata tables.

War story: You need to reboot the node when enabling row-cache though row_cache_size_in_mb cassandra.yaml configuration file. nodetool will not be enough.

Once this (annoying) configuration parameter is enabled, you can use CQL per table to enable row cache:

use apache_james ;
ALTER TABLE modseq
WITH caching = {‘keys’: ‘ALL’, ‘rows_per_partition’: ‘10’} ;

Tune compaction

Avoid reads on SSTables, and thus on disks can not be always avoided. In this case, we want to limit the amount of I/Os to the maximum.

Our problem is that writes to given key might be spread across several SSTables. For instance, this is the case for tables with many updates, deletes or with Clustering keys.

We can hopefully trade compaction time and I/Os against read efficiency. The idea is to switch compaction algorithm from Size Tiered Compaction Strategy to Levelled Compaction Strategy. Levelled compaction strategy will by its structure limit the number of SSTables a given key can belong to. Of course, read intensive tables, with updates, deletes or partition keys will benefit a lot from this change. Avoid it on immutable, not clustered tables, as you will not get collocation benefits, but will pay the costs of more expensive compactions.

Date Tiered Compaction Strategy stores data written within a certain period of time in the same SSTable. It’s very useful for time series of tables implying the use of TTL, where entries expire.

Updating the compaction strategy of your tables can be done without downtime, at the cost of running compactions. Warning: this might consume IO and memories, and thus decrease performances when the compaction is running.

You need to modify the CQL table declaration to change the compaction strategy:

use apache_james ;
ALTER TABLE modseq
WITH compaction = { ‘class’ : ‘LeveledCompactionStrategy’ };

For the changes to take effect, you need to compact the SSTables. To force this, you need to use nodetool:

nodetool compact keyspace table

Be careful not to ommit keyspace or table, if you do not want to trigger a global compaction…

For the following compaction on large tables, you can use:

nodetool compactionstats

The rule of thumb for compaction time estimate is, with our hardware (16GB, HDD), approximatively one hour per GB stored on the table.

Bloom filters

If you have a high false positive rate, then you might consider increasing the memory dedicated to bloom filters. Again, this parameter can be set per table. I will not detail it here as it was not a problem for us.

Compression

Compression leads to less data being read and less data being written. If I/O bound, this is a nice trade-off. It turns out default behaviour is to compress SSTables by chunks using LZ4. Of course, this can be tuned on the table level.

Commitlog

An obvious tip is to store your commitlog on a different disk than the SSTables. Thus I/Os are shared across disk.

What optimisation brought us

As a conclusion, Cassandra offers a data model that can be optimised in details. It requires knowing well the way Cassandra works. It also demands knowing well your data model. But significant improvments can be gained.

By applying these configuration changes on the table level, we achieved a x3 reduction in read latencies on some of our read intensive tables. We have a 62% row cache hit rate. And our Cassandra seems now to handle read load better. This tuning session has been both instructive, and promising. We even now have identified new room for improvements, for example on blob storage. Note that with a bad schema, good performances can not be achieved.

You don’t want your RAM to end up like this!

Also please note, that as Cassandra is a JVM application, single node performance is also impacted by your Garbage Collection settings. We decided not to cover this aspect of Cassandra configuration, as we did not have enough free memory to switch to the G1 garbage collector, and we would have ended describing minor settings. You can read this blog post, wich covers the topic pretty well.