Tombstones in Apache Cassandra

Published in

Walmart Global Tech Blog

8 min readJul 5, 2020

Apache Cassandra is a distributed database system where data is always distributed and usually replicated across a cluster of machines which are referred to as Nodes. Deleting data is not the same in Cassandra as it is with a relational database. Unlike a relational database system, Cassandra does not remove the data immediately but simply captures the delete operation as a marker on that data which is called a Tombstone. This very fact presents the challenge of propagating deletes on the stored data. If not properly handled, this can become a performance bottleneck, therefore it is crucial to understand how deletes are handled in Cassandra.

The Need for Tombstones

Apache Cassandra is a distributed database that offers high availability and partition tolerance with eventual or tunable consistency. It uses log-structured merge-tree storage that means, writes are always appended while reads take care of merging multiple fragments of a row by picking the latest version of data for each column. Any node in a Cassandra cluster can process a write operation and every write is sent to all replica nodes for that data. Now consider a situation when one of the replica nodes was down when a delete operation was performed; the node would simply miss the delete request. Once this downed node comes back online, it would mistakenly think that all the other nodes where the delete was applied had actually missed a write and would start repairing all the other nodes by sending the deleted data. This would lead to data resurrection issues in the cluster. Therefore in-place deletes will not work with a distributed system like Cassandra and would need a more sophisticated mechanism to handle deletes called Tombstones.

A Tombstone is a special marker for the data by which Cassandra creates a placeholder for the deleted data. What this means is, data is not immediately deleted from your data store, so you wouldn’t see the size of your data store shrink immediately following a delete operation. If any of the nodes did not receive the delete request, the tombstone can be replayed to such nodes when they are back online again. Each node keeps track of the age of all its tombstones. A configurable parameter gc_grace_seconds is the time for which tombstones will be retained by a node before a compaction cycle runs and the tombstones are garbage collected freeing up the disk space on the node. The default value for gc_grace_seconds is 864,000 (or 10 days).

Tombstones and reads

Every piece of data that is written to Cassandra is stored with an associated timestamp attached to it, this is applicable even for tombstones. Tombstones are scanned by Cassandra while servicing a read request. The shards of the data from memtable and SSTables are merged together along with tombstones and the correct data value is chosen with the Last-Write-Wins (LWW) algorithm. This means, your read performance will degrade with the number of tombstones present in your tables.

How to view tombstones on a table?

Tombstones are captured in SSTables as memtables are flushed to disk. SSTable by itself is an encrypted representation of your data on disk and so you need a utility called sstabledump to view the contents of an SSTable in a JSON representation. The sstabledump utility can be found in the tools directory under your Cassandra home directory. In this section, let’s look at how to view the tombstone information from an SSTable on disk.

Assuming a keyspace already exists, we’ll create a table to store the item price information based on stores.

CREATE TABLE pricing_svc.item_price (
 store_number text,
 item_id text,
 price float,
 replacements frozen<set<text>>,
 product_code text
 PRIMARY KEY (store_number, item_id, price)
) WITH CLUSTERING ORDER BY (item_id ASC, price ASC);

Now, let’s populate the table with a few records and execute a delete on the table,

INSERT INTO item_price ( store_number, item_id, price, replacements, product_code) VALUES ( ‘CA101’, ‘item101’, 1.50, {‘item101-r’, ‘item101-r2’}, ‘p101’);INSERT INTO item_price ( store_number, item_id, price, replacements, product_code) VALUES ( ‘CA102’, ‘item102’, 2.50, {‘item101-r’, ‘item102-r’}, ‘p102’);DELETE FROM item_price WHERE store_number = ‘CA102’;

Flush the SSTables to disk using nodetool flush. Once the SSTables are created, you can view them under the data directory.

The contents of the SSTable can be viewed in a JSON format using the sstabledump utility.

In this case, the following will be printed:

[
 {
  "partition" : {
     "key" : [ "CA102" ], "position" : 0, 
     "deletion_info" : { 
        "marked_deleted" : "2020-07-03T23:11:58.785298Z",
        "local_delete_time" : "2020-07-03T23:11:58Z" 
      }
  },
  "rows" : [ ]
 },
 {
  "partition" : {
     "key" : [ "CA101" ],
     "position" : 20
  },
  "rows" : [
    {
     "type" : "row",
     "position" : 95,
     "clustering" : [ "item101", 1.5 ],
     "liveness_info" : { "tstamp" : "2020-07-03T23:10:40.326673Z" },
     "cells" : [
       { "name" : "product_code", "value" : "p101" },
       { "name" : "replacements", "value" : ["item101-r", "item101-r2"] }
      ]
    }
   ]
 }
]

Similarly, the following will be the contents of an SSTable for a with a tombstone for a record inserted with a TTL that expired.

[
 {
  "partition" : {
     "key" : [ "CA103" ],
     "position" : 0
   },
  "rows" : [
     {
      "type" : "row",
      "position" : 74, 
      "clustering" : [ "item103", 3.0 ],
      "liveness_info" : { 
         "tstamp" : "2020-07-03T23:23:39.440426Z", "ttl" : 30,
         "expires_at" : "2020-07-03T23:24:09Z", "expired" : true 
        },
       "cells" : [
         { "name" : "product_code", "value" : "p103" },
         { "name" : "replacements", "value" : ["item101", "item101-r"] }
        ]
     }
   ]
 }
]

Types of tombstones

Tombstones can get created in a number of ways and it is crucial that you understand certain pitfalls that can lead to implicit creation of tombstones that will remain hidden from the programmer’s point of view until it surfaces as a plausible issue affecting your cluster. Therefore, it’s important to understand the types of tombstones that can be created in Cassandra:

Cell tombstones
Insert statements can create tombstones when a certain cell value is set as null in the query. This can happen when the database abstraction layer or an ORM framework abstracts the query with object-level representation and the null values get implicitly sent down in the actual query to Cassandra. For instance, consider the following CQL query:

INSERT INTO item_price ( store_number, item_id, price, replacements, product_code ) VALUES ( ‘CA104’, ‘item104’, 2.50, null , ‘p104’);

This would create a cell tombstone for the replacements column for the record with store_number CA104.
Now consider the following delete query:

DELETE replacements FROM item_price WHERE store_number = ‘CA104’;

This would also create a cell tombstone for the corresponding record.

Row tombstones
An entire row is marked as a tombstone as a result of a delete query that identifies a row. For example:

DELETE FROM item_price WHERE store_number = ‘CA101’ and item_id=’item101' and price = 1.80;

Sstabledump would show a deletion_info at the row level for the clustering columns within the partition.

[
 {
  "partition" : {
    "key" : [ "CA101" ],
    "position" : 0
  },
  "rows" : [
   {
    "type" : "row",
    "position" : 37,
    "clustering" : [ "item101", 1.8 ],
    "deletion_info" : {
       "marked_deleted" : "2020-07-05T07:26:52.233374Z",
       "local_delete_time" : "2020-07-05T07:26:52Z"
     },
    "cells" : [ ]
   }
  ]
 }
]

A large number of row tombstones can be an indication of a poor data model whereby your application is frequently deleting records from a table. In such cases, consider revisiting your data model and redesigning tables based on query patterns and the cardinality.

Range tombstones
Deleting an entire range of rows using WHERE clause with a partition key and a range represented by a clustering column. For instance:

DELETE FROM item_price WHERE store_number = ‘CA101’ AND item_id=’item101' AND price > 2.0;

SSTabledump would show,

[
 {
  "partition" : {
    "key" : [ "CA101" ],"position" : 0
   },
  "rows" : [
   {
    "type" : "range_tombstone_bound",
    "start" : { 
      "type" : "inclusive",
      "clustering" : [ "item101", "*" ],
      "deletion_info" : { "marked_deleted" : "2020-07-05T06:53:50.671654Z", "local_delete_time" : "2020-07-05T06:53:50Z" }
     }
   },
   {
    "type" : "range_tombstone_bound",
    "end" : { 
     "type" : "inclusive",
     "clustering" : [ "item101", "*" ],
     "deletion_info" : { "marked_deleted" : "2020-07-05T06:53:50.671654Z", "local_delete_time" : "2020-07-05T06:53:50Z" }
     }
   }
  ]
 }
}}

The SSTabledump would print range tombstones with a type of range_tombstone_bound with a start and end of the clustering key used to denote the range of rows that were tombstoned within a partition. Range tombstones are also created when an entire collection is replaced with an INSERT or UPDATE query. It is always recommended to replace specific elements of a collection rather than replacing the entire collection itself.

Partition tombstones
Tombstones of this type are created when a delete query is fired using only the partition key in the WHERE clause. For example:

DELETE FROM item_price WHERE store_number = ‘CA102’;

As you have undoubtedly guessed, this would delete the entire partition CA102 and the sstabledump would show the partition the deletion_info attribute with marked_deleted timestamp.

[
 {
  "partition" : {
    "key" : [ "CA102" ],
    "position" : 0,
    "deletion_info" : { 
       "marked_deleted" : "2020-07-05T22:11:48.367057Z",
       "local_delete_time" : "2020-07-05T22:11:48Z" 
     }
   },
  "rows" : [ ]
 }
]

TTL tombstones
These are tombstones created automatically when the time-to-live expires for a particular row or cell. However, they are marked differently than normal tombstones.

The following insert statement would create a TTL tombstone after 20 seconds.

INSERT INTO item_price ( store_number, item_id, price, replacements, product_code) VALUES ( ‘CA103’, ‘item103’, 3.0, {‘item101-r’, ‘item101’}, ‘p103’) using TTL 20;

The sstabledump would show:

[
 {
  "partition" : {
    "key" : [ "CA103" ],
    "position" : 78
  },
  "rows" : [
    {
    "type" : "row", 
    "position" : 155,
    "clustering" : [ "item103", 3.0 ],
    "liveness_info" : { "tstamp" : "2020-07-05T06:47:51.458099Z", "ttl" : 20, "expires_at" : "2020-07-05T06:48:11Z", "expired" : true },
    "cells" : [
     { "name" : "product_code", "value" : "p103" },
     { "name" : "replacements", "value" : ["item101", "item101-r"] }
    ]
   }
  ]
 }
]

Summary

Cassandra treats a delete query internally as an update operation which adds a marker called tombstone on the data to be deleted.
Tombstones can be configured with an expiry time (gc_grace_seconds) and are cleaned up during the compaction process.
sstabledump utility can be used to view the contents of an SSTable file in a human-readable format.
Avoid writing Null values to your tables, as there can create tombstones. Take care while making queries abstracted by your ORM layer.
Range tombstones are preferred over cell or row tombstones as it just stores the range boundaries saving disk space.
Avoid replacing entire elements of a collection like the set, list, map with INSERT and UPDATE queries as this can create range tombstones.

Tombstones in Apache Cassandra

The Need for Tombstones

Tombstones and reads

How to view tombstones on a table?

Types of tombstones

Summary

Published in Walmart Global Tech Blog

Written by Ginoy George

No responses yet