The Tombstones of Cassandra

David Faizulaev
ZOOZ Engineering
Published in
7 min readApr 3, 2019
Photo by Aiden Marples on Unsplash

Just not too long ago, the word ‘tombstones’ did not bring up any associations related to Cassandra database. If any, it brought to mind the legendary professional wrestler ‘The Undertaker’ and his ‘Tombstone Piledriver’… but then I found out that there’s another type of tombstones — Cassandra Tombstones.

Seemingly out of nowhere, services integrated with Cassandra started throwing various alerts. As these alerts were occurring in Production, and since Cassandra is an integral part of our system, we, at Zooz, immediately started investigating the matter. Not long after, we found out that tombstones are cluttering & slowing down our databases.

In order to tackle this issue, we dove into the problematic service to understand the cause of these alerts.

How does Cassandra write?

If there’s one thing to understand about Cassandra, it is the fact that it is optimised for writes, really fast ones. Cassandra does not check if data already exists before writing — that would slow things down. Instead it just performs an upsert, and in the case that this data record already exists, it will be marked as ‘deleted’.

When we write data to Cassandra, it’s written to both the commit log (on disk) and memetable (in memory). Once the memtable is full, the data is flushed to SSTable.

The memtable stores data via primary key and clustering column, while the commit log stores data sequentially and every new record is appended to the end of the commit log. The commit log is used for recovering data in the event that nodes crash, while the memtable is used primarily for reading data.

How does Cassandra read?

Now let’s address how Cassandra reads data. Cassandra must combine results from the active memtable and (potentially) multiple SSTables. Cassandra processes data at several stages on the read path to discover where the data is stored.

Cassandra’s read path

If that diagram is a bit scary, do not fear, let’s simplify.

Cassandra READ works in stages:

  1. Check memtable
  2. Check row cache, if enabled
  3. Check Bloom filter
  4. Check partition key cache, if enabled
  5. Check compression offset map
  6. Locate the data on disk using the compression offset map
  7. Fetch the data from the SSTable on disk

What are tombstones:

Well, tombstones are records of old data, either deleted or no longer relevant as the record has changed. As you can deduce, this can cause a few issues.

First of all, tombstones take up space and can increase the amount of storage your data requires. Second, querying tables which contain a large number of tombstones causes performance issues.

As mentioned previously, Cassandra is built for fast writes, so it makes sense that Cassandra’s process for deleting data doesn’t actually delete any data, it’s just marks it for `deletion`, to be later handled by the garbage collector.

Therefore, we can conclude that one of the primary suspects for performance issues (timeouts, latency, memory heap size) in the read path of Cassandra is the presence of unnecessary tombstones.

Consider a scenario in which you have a table with a single valid record but with thousands of tombstones. In this scenario, a simple READ can take much longer than expected.

At Zooz we encountered such a scenario. Here’s one of the alerts generated by our Cassandra:

You are seeing this correctly, 15610 rows were scanned while just a single valid record exists.

Now multiply that by thousands or even hundreds of thousands of rows and subsequent tombstones, and you can easily see the numerous performance issues that may occur.

How do we get tombstones?

Below are a few scenarios that will create tombstones in your data.

  1. Delete: as SSTables are immutable, no data can be deleted. Instead, Cassandra marks that record with a deletion marker making it a ‘tombstone’. So instead of removing data, we are actually adding data.
  2. Inserting NULL when creating new row will always create a tombstone.

This is a JSON representation of the inserted row in the database:

JSON representation of row in Cassandra

The ‘d’ flag represents deleted data, so even though we inserted data, because we inserted NULL under the ‘status’ column, Cassandra will treat that value as deleted.

3. UPDATE: updating existing column value to NULL will create a tombstone. Understandably, at times this cannot be avoided.

4. Inserting new collections (such as set, list, and map) instead of updating and appending to existing ones.

5. TTL: setting a TTL (Time To Live) is one alternative for deleting data . explicitly, but it results in the same tombstones. The same applies to TTL 0, the data will be immediately marked for deletion, but won’t be deleted until the next garbage collector interval arrives.

How do we track and avoid tombstones?

I know you have been waiting for this part… well, here at Zooz we like to take matters into our own hands and once the source of tombstones was discovered we (both developers and DBAs) implemented numerous steps in order to track and avoid such issues in the future.

So what can you do to escape these scary tombstones?

Here are a few solutions:

  1. Avoid inserting NULL values when creating new rows, remember the example above? Well this is how we do it right:

Leave the ‘status’ out of the query, Cassandra will insert its own default value for NULL and no tombstone will be created.

When working with collections, append and subtract using arithmetic operators (+ and -).

3. Cassandra configuration: this is related to the way Cassandra is setup in your environment.

There are a few configuration changes that you can make in order to not only monitor but also to avoid tombstones.

a. Configuring various environment variables such as:

  • Tombstone_threshold
  • Tombstone_compaction_interval
  • unchecked_tombstone_compaction

b. Using ‘TombstoneScannedHistogram’ histogram — Histogram of tombstones scanned in queries on a specific table.

c. You can check the Cassandra log and see how many tombstones entries were read against live data entries.

d. Routine repairs must be run on clusters where deletions occur (they may occur even if you don’t explicitly delete anything, as was explained) to avoid among other things deleted data becoming live again.

e. Setup alerts & change the tombstone warning / failure threshold: there are two tombstone threshold settings in Cassandra that are helpful for detecting a large number of tombstones affecting performance:

  • tombstone_warn_threshold (default: 1000): if the number of tombstones scanned by a query exceeds this number, Cassandra will log a warning(which will likely be propagating to your monitoring system and send you an alert).
  • tombstone_failure_threshold (default: 10000): if the number of tombstones scanned by a query exceeds this number, Cassandra will abort the query. The is a mechanism to prevent one or more nodes from running out of memory and crashing.

These values should only be changed upwards if you are really confident about the memory use patterns in your cluster.

4. Garbage collection: the GC has an environment variable called — gc_grace_seconds.

It indicates the number of seconds after which deleted data is eligible for garbage-collection. Cassandra will not execute hints or batched mutations on a tombstone record within its gc_grace_period. The default value allows a great deal of time for Cassandra to maximize consistency prior to deletion.

The default value or gc_grace_seconds is 864000 seconds (10 days). The expiration time for a tombstone is the time of its creation plus the value of gc_grace_seconds.

In a single-node cluster, this property can safely be set to zero. This value can also be reduced for tables whose data will not be explicitly deleted — for example, tables containing only data with TTL, or tables with default_time_to_live.

Make sure you are capable of supporting repairs more frequent than the minimum gc_grace_seconds among all your tables.

Conclusion

Tombstones are among the most misunderstood features of Cassandra.

They can cause significant performance problems. Developers must pay attention when adding tables and writing queries, whilst DBAs must do the same in addition to proper database maintenance and configuration.

Tombstones is a Cassandra feature after all, it’s just part of how Cassandra manages to work so fast, but knowing that they exist, and managing them wisely will make your DBA’s and developers’ life much easier (Boo).

--

--