Distributed Hash Tables

Data storage and routing in decentralized applications

Prashanth Irudayaraj
Keep Network
7 min readDec 12, 2017

--

This is the first article in a series exploring the underlying concepts behind the modern decentralized systems, including the Keep network.

How would you go about quickly and efficiently storing and retrieving large amounts of data? And when I say large, I mean in the order of exabytes. This is part one of a two-part series about storing and retrieving data for the Keep network.

Storing data is an enormous problem, and its getting worse with more digital content being generated and more data stored in the cloud. The solution can be found in something called distributed hash tables, or DHT’s. In this first part of our series, we will be exploring what DHT’s are and how they are used by well known entities such as BitTorrent and Storj, and new networks such as IPFS.

DHT’s are currently an area of significant innovation and are one of the core building blocks of distributed technology. They form the backbone of file storage and node discovery, which is an essential need of the Keep client. Clients need to be able to find each other and share data to be productive members of the network. A good DHT is therefore essential to Keep’s success.

Before we do a deep dive into distributed hash tables, let’s quickly review what a hash table is and how it works.

Hash Table

A hash table, also known as a hash map, is a type of data structure. To understand a hash table, let’s use a library bookshelf as an example. In a typical library, the Dewey Decimal System (DDS) is used to create a storage ID called a call number, that gives each book a specific placement location for storage and retrieval. This is generated by applying the rules of the Dewey Decimal System to a book based on the category it falls under. The library shelves have numbers on them that correspond to the range of call numbers available through the DDS.

Dewey Decimal System Categories

You may remember from your days in school that this made it very easy to find books. All you had to do was look up the call number of the book, and then navigate your way through the library using a map of where rows of bookshelves with specific call number prefixes were located. When you found a row with the correct prefix, you could then locate the book by sequentially following the numbers on the individual shelf until you found the book you wanted.

Let’s return from the land of books to the land of bits. The Dewey Decimal System of assigning call numbers is in many ways a very simple hash function, and the call number is a hash value. Books can be viewed as “data”, and the shelves are analogous to data storage locations with hash values to uniquely identify them. Finally, your brain and spacial mapping ability is the equivalent to a routing algorithm that enables you to use the hash value of the book to retrieve it from its storage location.

The hash function, hash value, and routing algorithm are key elements of a hash table. A hash table utilizes these to store and retrieve large amounts of data instantly and without error. To accomplish this the hash function, hash value and routing algorithm need to have some very specific functionality.

The hash function must be deterministic, which means that there is a one to one mapping between input and output. The hash value needs to belong to a large enough ID space to allow for a large amount of data to be stored. To be able to instantly retrieve large amounts of data, the routing algorithm must be efficient; i.e. it needs to retrieve and store data from a location with the least number of operations.

A database of phone numbers as an example of a hash table

From Hash Tables to Distributed Hash Tables

To obtain a high level understanding of how DHT’s work, we can extend our library analogy to include libraries across countries and continents. Lets assume that we wanted to provide library users access to books located anywhere in the world. To do this I’d like to introduce you to our librarian, Tom.

As a librarian, Tom is responsible for storing books and finding books for library goers. He works at a local library in Denver, Colorado. This library can store a fair number of books, but not all the books the local library goers could want.

If our library system were structured similar to a distributed hash table, Tom’s Library would be one node in the network. To find a book in the network, Tom would have a list of nearest library’s he can reach out to first to see if a requested book was available. If none of these had the book, they would reach out to their nearest libraries, and so on until the book was found, or no libraries remained.

Further, since not all books can be stored at every library, the system might make certain libraries responsible for some books and not for others. Responsibility could be assigned based on books’ metadata such as country of origin, language, etc.

Once again, from the land of books to the land of bits, in a typical distributed hash table, data is stored across many nodes (libraries) that enter and leave the system. The DHT protocol determines how data is distributed among these nodes, ensuring that they can be easily searched for and retrieved.

The Evolution of Distributed Hash Tables

The advent of mainstream distributed hash tables occurred in the late 1990’s and early 2000’s, coinciding with the emergence of P2P systems such as Gnutella, Freenet, Napster, and BitTorrent. These systems sought to take advantage of distributed storage and computation resources, and developed their own protocols to suite their needs.

The following are some notable, but not comprehensive, events in DHT history:

Napster

Napster used a central indexing server to provide a “host list” of users who were currently logged into the system. The server did not contain any files, but had a list of files hosted by users in the host list. When a user searched for a file using the Napster client, the server would look through the host list and their list of files to see who had them and provide the requester with a list of hosts who had their file. The user’s Napster client would then initiate a connection with the host the user selected and initiate a download. The client would terminate the connection after the download was completed

If we apply our library analogy to Napster, it would be like having a central library responsible for storing the location of books at all library’s that had ever interacted with it. If a user wanted a book, they would need to contact the central library first, retrieve the information of the library where the book was held, and then contact that library to get the book.

Napster’s key drawback as a DHT was its central indexing server, which created a single point of failure for the system.

BitTorrent

BitTorrent invented the modern DHT. The service launched in 2005 on the Mainline DHT protocol. Mainline DHT is a very good distributed system and an improvement over the Napster system because there is no central point of failure. The system organically adapts to nodes coming in and leaving the system. This aspect is key to the functioning of BitTorrent because the churn rate is high with users around the world joining and leaving the system at unpredictable times.

The BitTorrent system is more like a true DHT. The distributed library system described above is a good high level view of how the BitTorrent system works.

Storj

Storj connects people with storage needs to storage providers through smartcontracts over the Storj protocol. The Storj protocol uses a modified version of the Kademlia protocol, which shares many of the features of the BitTorrent Mainline DHT.

IPFS

Similar to the world wide web, IPFS (Inter Planetary File System) is a distributed file system that links nodes containing specific file types. Unlike the world wide web, content is not stored on individual servers, and instead is distributed across several nodes. To distribute this content, IPFS makes use of a BitTorrent like DHT protocol, developed by Protocol Labs. IPFS is open source and most recently was used by the Catalan Pirate Party to host websites blocked by the Constitutional Court of Spain.

Filecoin

Filecoin is a distributed storage network that uses elements of the IPFS protocol in conjunction with an incentive layer for storage providers. The protocol for Filecoin is currently being developed by Protocol Labs, the same group developing IPFS.

Keep

DHT’s are a critical tool in the Keep toolbox. Much like the use cases listed above, Keep clients will be spread out around the world and will experience churn. In addition to this, they may also experience attacks from malicious entities and the Keep protocol must continuously evolve to be resistant.

Keep will use DHT’s to create a robust distributed protocol that can meet these needs and be amenable to open source development. Being open source will enable a global network of independent developers to innovate on the the protocol. One of the first efforts of the Keep team will be to use the open source libp2p network stack to enable discovery of Keep clients. Stay tuned for part two of our decentralized education series by Keep’s very own Nik Grinkevich about the libp2p network.

Learn More

For more information about the Keep Network:

--

--