Understanding Merkle Trees

the quintessence of Git, Bitcoin, and DynamoDB

Kumar Swapnil
Geek Culture
Published in
5 min readFeb 10, 2021

--

A hash tree or Merkle tree is a tree in which every leaf node is labeled with the cryptographic hash of a data block, and every non-leaf node is labeled with the cryptographic hash of the labels of its child nodes. Because of this characteristic, Merkle Trees are used to verify that two or more parties have the same data without exchanging the entire data collection.

Here are some of the major benefits of Merkle trees:

  • They provide a way of proving the integrity and validity of data. Even a small tamper of data will result in a completely different hash.
  • Their proofs only require tiny amounts of information to be transmitted across networks. They separate the validation of data from the data itself.

Let’s see some of the interesting use cases of the Merkle tree:

Git

Before understanding how Git leverages the Merkle tree, let’s first see how Git stores our source code.

Git stores content in a manner similar to a UNIX filesystem, but a bit simplified. All the content is stored as tree and blob objects, with trees corresponding to UNIX directory entries and blobs corresponding more or less to inodes or file contents. A single tree object contains one or more entries, each of which is the SHA-1 hash of a blob or subtree with its associated mode, type, and filename. For example, the most recent tree in a project may look something like this:

The master^{tree} syntax specifies the tree object that is pointed to by the last commit on your master branch. Notice that the lib subdirectory isn’t a blob but a pointer to another tree:

Conceptually, the data that Git is storing looks something like this:

Let’s consider we have 8 files (including trees, blobs, and metadata). The Merkle tree would look something like below:

INITIAL COMMIT

Now, if we change the contents of the first file, the corresponding hash would change, and this change would go all the way to the root of the tree.

FILE 1 CHANGED

This helps Git to maintain two versions for a repository efficiently and this is how it stores the entire history spanning thousands of commits locally for millions of lines of code.

When you take a pull from remote or push your changes, git will check if the hash of the root are the same or not. If it’s different, it will check for the left and right child nodes and will repeat it until it finds exactly which leaf nodes changed and then only transfer that delta over the network.

Note: The metadata includes information such as a pointer to a parent commit as well as author information and a commit message as a commit object.

Bitcoin

The ingenious invention of Satoshi Nakamoto, Bitcoin, relies on the Merkle tree to keep growing at scale. In Bitcoin, a Block consists of the hash of the previous block (chaining of blocks), a Nonce, and the Transactions hashed in a Merkle tree (as shown in the diagram). This helps in two things:

  • First, periodically old blocks are compacted by pruning transactional branches off the tree. A block header with no transactions would be about 80 bytes. If we suppose blocks are generated every 10 minutes, 80 bytes * 6 * 24 * 365 = 4.2MB per year. With average RAM size increasing year by year, storage should not be a problem even if the block headers must be kept in memory.
  • Second, It makes it possible to verify payments without running a full network node (Simplified Payment Verification). SPV nodes don’t download the entire blockchain, they just download the block headers which enables them to verify if particular transactions are included in a block.

DynamoDB

Outside of the hype of cryptocurrencies, NoSQL databases like DynamoDB and Cassandra relies on Merkle trees to detect the inconsistencies between replicas faster and to minimize the amount of transferred data.

Dynamo uses Merkle trees for anti-entropy as follows: Each node maintains a separate Merkle tree for each key range (the set of keys covered by a virtual node) it hosts. This allows nodes to compare whether the keys within a key range are up-to-date. In this scheme, two nodes exchange the root of the Merkle tree corresponding to the key ranges that they host in common. Subsequently, using the tree traversal scheme described in the Git part of this post, the nodes determine if they have any differences and perform the appropriate synchronization action. The disadvantage with this scheme is that many key ranges change when a node joins or leaves the system thereby requiring the tree(s) to be recalculated. This issue is addressed, however, by the refined partitioning scheme described in Section 6.2 of this post.

Conclusion

As stated at the beginning of the post, Merkle trees separate the validation of data from the data itself and this opens up an array of optimizations and use cases. Apart from the above use cases, Cryptographic auditing and integrity checking often rely on Merkle trees. Certificate transparency is a security technology that relies on Merkle trees to check the validity of TLS/SSL certificates. These are only a handful of use cases and I believe there would be many interesting ideas and implementations out there, leveraging this awesome data structure known as Merkle trees.

References

--

--