Database Papers: Anti-Entropy without Merkle Trees, Deletes without Tombstones

This week’s reading was a paper by the awesome HASLab research group. Paper addresses a problem of anti-entropy overhead, generated by Merkle Trees and introduces a framework for per-object causal consistency. The paper itself can be found here, another paper describing the clock implementations mentioned in paper in more details can be found here, the database implementation is published here and the clocks library can be found here.

This post is not meant as a replacement for reading the papers, but might help to quickly glance through the concepts before diving into it. Another reason to publish this is to open a discussion and learn more about the concepts discussed in the paper and related concepts.

Background Concepts

Anti-Entropy: in systems without strong coordination, consistency guarantees are somewhat loosened. In order to keep nodes in sync in presence of network partitions, anti-entropy protocols are used in order to “repair” the data and reconcile missing records.

Merkle-Trees are a binary hash-tree, holding hierarchies of hashes, often used for conflict resolution. Each inner node stores the hash of its children hashes; the leaf nodes store a list of key-hash pairs. By comparing hashes from top to the bottom, it is possible to locate the difference between trees and, subsequently, the information they represent.

Causal Consistency — a model that captures causal relationship between two operations and guarantees that all processes can observe these operations in common causal order. This is an important model in distributed systems, since even perfect physical clocks, which cannot be achieved in practice, will still fail to capture the causality between updates.

Distributed Deletes: given a delete request, completely removing an object from storage is normally not possible without losing causality information which may lead to that object resurfacing via delayed replication messages or synchronisation with outdated nodes. In order to solve this problem, tombstones serve as placeholders for a key, indicating that the record was removed.

Introduced Concepts

The paper introduces NDC (Node-wide Dot-based Clocks). Here, “dot” means a globally unique identifier, a pair of Node ID and a node-wide monotonically incrementing counter. Every node keeps the log of operations, occurred locally or replicated, per peer. During anti-entropy, a node compares it’s own local Node Clock (represented in a compact way as a Bitmapped Version Vector) with remote Node Clock.

Node Clock summaries the local storage history, causality metadata for each object version. It represents which update events this node has seen, directly (coordinated by itself) or transitively (received from others). Concretely, the node clock groups dots per peer node, factoring out the node id part from the dots. For each node, the node clock represents the set of counters in two parts: an base counter representing the contiguous sequence starting from 1 (as in Version Vectors), and a set of non-contiguous counters. It can be thought of as indicated below:

Node Logical Clock, implemented as a Bitmap Version Vector as depicted in original paper.

In order to map dots to the actual keys they represent, a special Dot-Key map is stored. So when a certain dots are missing during repair or anti-entropy process, they’re filled using this map.

In order to keep the clocks compact, node base versions are “stripped” and replicated versions are removed from causal context.

Conclusion

In the paper title, one of the biggest highlights is deletes without tombstones. One of the problems with tombstones is correctly setting their GC period, when they can be removed from storage: usually this period should be longer than the longest maintenance/network partition period, otherwise a non-propagated tombstone will get garbage collected and, when the node holding a record that should’ve been shadowed by the tombstone comes back, the record will resurface.

Explicit tombstones are indeed not required anymore on the storage level: as soon as object clock is stripped and no value remains, the entry can be safely removed from storage, so the Node Clock will serve in place of tombstone. Moreover, this approach seems to be more reliable (always correct, as paper describes it) than the tombstones as it relies on causal information rather than timestamp shadowing.

Second exciting statement is Anti-Entropy without Merkle Trees. The paper notes that Merkle Trees don’t work well with consistent hashing, as the keys are spread across nodes, which this destroys any locality patterns in the key space. It might be that “recency” is a property that might be worth considering: during the network partition, there’s a higher chance of temporally local records not reaching the node. Another downside of Merkle Tree is that it requires frequent updates of a hash tree and presents a trade-off between hash tree size and risk of false positives. Framework proposed in the paper allows for a much more compact repair mechanism, also presenting format optimisations such as bitmap.

Paper also discusses Causal Consistency vs Timestamp-Based Conflict Resolution, suggesting that the timestamp-based approaches might be preferred because of lower metadata overhead and simplicity, but methods suggested in the paper make the former option much more attractive.

If you like reading database papers, join Databass Telegram or Slack. You can also follow me on Twitter.