Distributed Hash Tables in Decentralized Environments, Part I

Eric Wang
archoncloud
Published in
6 min readJul 31, 2019

Eric Wang is a Co-Founder @ Archon Cloud, a Decentralized Cloud Storage Service, where he leads research and other efforts.

TLDR: The ability to find who holds data and where the data is held is vital in peer-to-peer systems. This article is a technical introduction to one of the most popular methods for peer discovery, Distributed Hash Tables (DHTs). DHTs allow users to find data very efficiently with very small overhead, but they suffer from security issues that prevent them from being used in a decentralized system for files of high value. A more secure yet very efficient methods are needed for high value files.

Introduction

Peer-to-peer platforms like BitTorrent have made it very easy for users to disseminate content. But underneath the process of finding people and data in a p2p network is a complex science in itself. This process is called peer discovery, a very hot topic in academic research and blockchain. This article describes and explores Distributed Hash Tables (DHTs), one of the most popular peer discovery methods used in systems such as BitTorrent and IPFS. Ultimately, DHTs have numerous strong use cases, but they do not work well for many more important use cases. This article is divided into two parts. Part I of this article describes the development of DHTs as well as how they work and their strengths. Part II of this article discusses vulnerabilities and weaknesses of DHTs, which prevents DHTs from being adopted by more public networks.

Older BitTorrent versions serves as a good illustration of peer discovery in a network. To the end user, there is a seemingly simple procedure:

  1. Load a .torrent file in order to start each download.
  2. Download starts.

Sounds simple, but there are many things that need to happen in the background.

[Image taken from https://www.flickr.com/photos/activesteve/36859460762]

Participants need to discover who else is participating within the system and what kind of data these other users have. Then, they need to download the correct data from the relevant users. The .torrent file was Bittorrent’s way of getting that correct data; it tells us information about where trackers, who hold information about seeders, are located. However, since trackers usage was cumbersome (the user needs to download a .torrent file and load it each time) and centralized (trackers are centralized entities), people developed easier and less centralized ways to discover peers.

A way to discover peers and files

Enter Distributed Hash Tables. Distributed Hash Tables are a powerful and lightweight method to discover peers. With minimal hardware requirements to load and a relevant file hash, users can quickly find the file they are looking for. BitTorrent realized the benefits to DHTs, adopting the tech with its “Mainline DHT” implementation, which rendered .torrent files mostly obsolete. Dozens of types of DHTs have since popped up, each with different advantages and disadvantages. With the rising popularity of blockchains, people began to experiment with ways to adopt DHTs for use in conjunction with immutable ledgers. For many of these blockchains, DHTs are a crucial component to ensure the network functions correctly. To understand DHTs’ pros and cons, why they work well for BitTorrent, and why they are not suitable for use in a high-value decentralized network, we must understand exactly how DHTs work.

What exactly are Hash Tables, and how do they work?

In order to understand Distributed Hash Tables, we must first examine what hash tables are. Hash tables are used in many centralized systems to store information such that the user’s computer can quickly reference them, saving time and computational resources. They are tables that store a hash as a key, and additional information corresponding to that hash as a value; it is a type of key-value database. A hash is a fixed-size representation of a piece of data.

For example, one can imagine that ISBN-10 numbers are hash values of a book the user might want to buy. By searching up the ISBN-10 value on a website, we can quickly find the book it corresponds to. Opening up the book will allow the user to learn the contents corresponding to the ISBN-10. Knowing the hash will allow the user to efficiently find and retrieve any arbitrary amount of data. The type of hashes we deal with in DHTs are cryptographic hashes. Cryptographic hashes are hashes where one input almost always generates a unique output, making them cryptographically secure. This means that the user cannot give the user incorrect data, as if they check it against the correct hash, the incorrect data will not hash into the correct hashed output.

What exactly are Distributed Hash Tables, and how do they work?

Distributed Hash Tables, then, are distributed key-value stores. In DHTs, no node individually has complete information. Each node stores a different subset of key-value pairs and can contact other nodes to obtain key-value pairs they might not have. Since most DHTs work in a similar manner, we will look at the most popular DHT available right now — Kademlia.

Variations of Kademlia have been utilized in networks such as Storj and IPFS, as well as earlier P2P networks such as BitTorrent. Kademlia Benefits include:

  1. the storage requirements are minimal. That means that DHTs are quite efficient in terms of storage requirements, and normal consumers can use them.
  2. looking up any node within the network only requires O(log n) steps, which means that if there are 10,000 nodes in the network, it should only take at most log2 (10,000) steps to locate a node the user is looking for. As a result, DHTs can scale to handle millions of participants or more and discover files quickly.

In order to understand Kademlia, there are two concepts we should first introduce: Node ID and XOR distance. Once we understand these, we can see how DHTs allow us to find users and information with O(log n) speed.

  1. Every valid user in Kademlia has a Node ID. This is essentially an identification badge that tells others who the user is. Node IDs are defined in 0s and 1s (bitspace). So a 160-bit node ID would look like either 160 0s and 1s or an amalgamation of 20 letters and numbers (since there are 8 bits per byte). A DHT designed for a blockchain system can have the user’s 160-bit hashed account address serve as the Node ID, for example.
  2. To find a way to order users within the system, Kademlia introduces the “XOR distance” metric. The XOR distance is the following: The XOR operation applied to the two node IDs represented by 0s and 1s tells the user how different their bits are, and a quantification of this difference is the XOR distance between those two IDs. A review of XOR arithmetic can be found in [1].

In Kademlia, there are 160 bits for every node ID, so the difference between the user’s node and any other node can be between 0 and 160 bits. They thus create various groups, based on the node ID distance between the user and others, called k-buckets. The “k” in a k-bucket designates how many nodes there are in each group, and because of this limitation, the nodes in each k-bucket often change, allowing only active nodes to be on the list.

Example: Decentralized P2P Storage System

Let’s look at an example of a decentralized P2P storage system. To learn more about decentralized storage, please view my article in [2]. In this example, files can be indexed by a certain subset of nodes, once again using the XOR distance metric. Files are first hashed to create a hash that is the same length as the node IDs, and the closest nodes to these files (based on the XOR distance of the hash to the node ID) will then index them. Tying these things all together, given the hash of a file, the user:

  1. Pings ~log(n) nodes until the user finds one of the nodes within the k-bucket that stores and indexes that file.
  2. Contacts that node individually with a secure connection and retrieve the file.

Alternatively, the user can use the DHT such that the closest nodes to a file hash will serve an IP list of all nodes who store that file, as opposed to the actual file itself; updates to and queries for the IP list can be made by contacting these closest nodes.

I hope part I of the article explains why DHTs are important and how they work. In part II, we will explain how DHTs fall short for certain decentralized applications, and discuss why it’s important to make file discovery secure and efficient.

Sources

[1] https://hackernoon.com/xor-the-magical-bit-wise-operator-24d3012ed821

[2]https://hackernoon.com/combining-blockchain-and-file-storage-might-just-decentralize-the-internet-d94c2701a8fa

--

--