Exploring Distributed Hash Tables with Beehive

Published in

Princeton Systems Course

9 min readJan 26, 2020

Andy Jones and Grigory Chirkov

Introduction

Distributed hash tables (DHTs) are a type of peer-to-peer distributed system designed to store data across multiple nodes efficiently. At a general level, a hash function is used to assign data objects to nodes, and lookup queries are routed across several nodes to find an object of interest.

In unstructured DHTs, lookup messages are sent to all nodes in the network, which results in poor performance. On the other hand, structured DHTs — such as Chord, Pastry, and Beehive — use routing tables to propagate messages among nodes to more efficiently find the relevant node.

A major goal and challenge in building any DHT is minimizing the time required to perform a lookup query. Typically, the efficiency of a DHT is measured by the number of “hops” — or messages between nodes — that are required to respond to a query. Unstructured DHTs require O(N) hops to perform lookups in networks of N nodes because an independent query is sent to each node in the system. Structured DHTs such as Chord use efficient routing protocols, such as binary tree data structures, to achieve O(logN) hops per query.

In this project, we examined a DHT called Beehive that is designed to reduce latency even further — down to a time that is constant in the number of nodes. We explore this system in theory and practice below.

System overview

Beehive is a structured DHT that guarantees O(1) hops to perform lookups. It achieves this efficiency by replicating data across nodes in a way that exploits the typical distribution of queries. Specifically, many query distributions follow a Zipf law, which is a popular empirical model of query popularity across object. In a Zipf distribution, object popularity (i.e. percent of overall queries going to each object) roughly decays exponentially with the rank of the object’s popularity.

Fig. 1: The Zipf distribution at various parameter values. In our case, the horizontal axis “k” represents the popularity rank of objects, and the vertical axis represents the objects’ popularity. Image credit to the Zipf Wikipedia page.

Beehive estimates the empirical form of the Zipf distribution at a given time point, and then designs an optimal replication scheme for the DHT that minimizes query latency. In a typical structured DHT, such as Chord, queries are initially sent to a (random) node in the system, and subsequently routed along nodes using an index table until the desired object is found at the object’s “home node”. Beehive, instead of only storing each object at one node, replicates each object to its home node’s predecessors according to the object’s current popularity, thereby reducing the number of hops required to reach the object. High-popularity objects are replicated further up the prefix-hopping list, and unpopular objects have little to no replication. Under this scheme, the most popular objects will be easily available using a small number of hops from anywhere in the network, and the overall average number of required hops is kept low.

Replication protocol

The replication protocol follows two phases: analysis and replication.

Nodes and objects are represented by numerical identifiers, and Beehive uses “prefixes” to identify objects within the system. For example, if an object has identifier 0121, its prefixes include 012*, 01*, 0*, and * (the empty string).

Analysis phase

In the analysis phase, each node independently estimates the popularity of each object, and subsequently determines each object’s proper replication level. In particular, for each level i, each node estimates the fraction of objects that should be replicated at level i or lower — call this fraction p_i. In preparation for replication, it then sorts the objects by popularity, and locally changes the metadata to make the replication level of the top p_i objects reflect the change. (Note that, for brevity, we omit the mathematical details of the estimation procedure here.)

Replication phase

During replication, each node communicates with nodes one hop away from its own position. In particular, each node A follows the following protocol. Node A sends a message to each of its successors B with a list of the identifiers for which B is the “deciding node” for the object. A deciding node determines whether an object’s replication should increase, decrease, or stay the same. B then sends back a list of objects that must be replicated in A, and a list of objects that no longer need to be replicated in A. Finally, A stores the objects in the first list, and deletes the objects in the second list.

The figure below illustrates how an object 0121 is replicated. At first, it’s only stored at its home node, E, at level 3 (012* has 3 matching prefixes with 0121). At some point, nodes B, E, and I will send a message to E indicating that E is the deciding node for object 0121. If E indicates that object 0121 should be replicated at level 2, it pushes the object to nodes B, E, and I. At another point in time, nodes A and C will contact B, D and F will contact E, and G and H will contact I in order to check if object 0121 should be replicated further. Nodes, B, E, and I will independently decide whether to replicate the object to level 1 — they need not all perform the same action.

Fig. 2: Example predecessor-successor relationships, where arrows point from successor to predecessor. Beehive uses numerical identifiers and prefixes to route queries through the network and perform replication. Figure credit to the Beehive paper (see references).

This protocol happens at regular intervals throughout the life of the system. By repeatedly performing the analysis and replication phases, the system is able to adapt to objects entering the database and to changes in object popularity over time.

Implementation

We wrote our implementation of Beehive primarily in C++, and we used the Oversim framework for performing simulations of the DHT through a series of “put” and “get” operations. In a deviation from the authors of Beehive, we decided to build Beehive on top of Chord, rather than on top of Pastry. Chord is simpler in its design and implementation, and we were more familiar with it because we had learned about it in the class lectures.

Description of Oversim

OverSim (reference here) is an open-source overlay and peer-to-peer network simulation framework. It is based on OMNeT++ simulation environment (reference here), which means that the simulation is built on top of accurate network simulator. OverSim structure is shown on the diagram below:

Fig. 3: Overview of Oversim. Image credit to the Oversim webpage.

OverSim consists of three different tiers: Underlay, Overlay and Application. Underlay tier is responsible for simulating all low level network communication. Overlay tier provides a routing interface and matches keys with nodes, responsible for them. Application tier acts as a load generator for the whole system. OverSim is highly modular, meaning that you can independently choose different modules in each tier. Different tiers are highly isolated from each other, communication between them happens through predefined abstract interfaces. This had a huge effect on our implementation, as described below.

Implementation details

The initial goal for this project appeared to be too ambitious, because we did not take into account the complexity of learning and extending such a complex simulator. We had to simplify our work:

We decided to build Beehive on top of Chord (instead of Pastry). This starting point was simpler and contained less sophisticated code.
One of the big contributions of the paper is popularity-based level of replication, separate for each object. Counting popularity statistics is a separate big task — that is why our minimum viable product was to implement a version of Beehive that performs a constant level of replication for all objects.

Implementing the Beehive model in OverSim highlighted the main disadvantage of this protocol: it mixes Overlay and Application layers. In Beehive, the presence of data on the node affects the routing, because get request should be routed not to the home node, but rather to the closest node, which contains data. This forced us to break many abstractions between tiers and this was the most time-consuming part of implementation. We also believe that this complication would arise during a real world prototype implementation, so this drawback is not related only to the simulator.

Evaluation

We evaluated our implementation primarily by measuring the latency of queries to the system. Oversim has several simulation protocols that perform a series of “put” and “get” operations, so we put our implementation to the test in this simulation.

As a first test, we simulated Beehive’s performance at several different levels of replication. The “level of replication” in this context is the number of predecessors that store the replicated data for any given node (recall that our system is an approximation to Beehive, and thus has a constant level of replication for all objects). We ran Beehive at five different replication levels: 2, 4, 8, 16, and 32 (repeating this five times for each level), and subsequently extracted the mean latency for “get” operations on each run. The replication delay — the frequency with which replication happens — was set to 1000 milliseconds for all experiments. Figure 1 shows the results, and we can clearly see that the system is performing faster queries with higher levels of replication, as expected. Interestingly, the latency seems to decrease linearly with exponentially-increasing replication level.

Fig. 4: Mean latency for Beehive queries across different levels of replication. “Replication level” here refers to the number of predecessors on which each object is replicated.

To benchmark our implementation, we also measured the query latency for Chord, as this was the underlying system on top of which we build Beehive. We used Oversim’s built-in implementation of Chord. Figure 2 shows the query latency for Chord and Beehive (using a replication level of 32). As expected, Beehive’s query latency is lower than Chord’s, owing to its replication of objects to predecessor nodes, and thereby decreasing the length of the path that queries must follow before returning to the client.

Fig. 5: Mean latency for queries in Beehive and Chord. Replication level is 32 here.

Although Beehive’s replication protocol has great advantages in reducing latency, it comes at a cost of the overhead spent on coordinating and executing the replication. Figure 6 shows the number of bytes per second sent for maintaining replication. Clearly, there is a significant cost incurred when using heavy replication.

Fig. 6: Bytes sent per second on “maintenance messages”, which includes messages used to coordinate replication, as a function of the replication level.

Discussion

In this project, we reimplemented a version of Beehive, which is a structured distribution hash table that ensures low-latency queries by using a clever replication strategy. During the course of the project, we came face-to-face with some difficult technical challenges, and although we weren’t able to implement a full version of Beehive with all its complexities, we implemented a core piece of it: the replication protocol.

Overall, although Beehive’s theoretical guarantees are quite promising, the challenge of implementing a fully-functioning replication of Beehive was greater than we expected. As mentioned above, a major challenge was properly interfacing between the overlay and application layers. Oversim is designed such that these two layers are kept largely independent of one another, but Beehive requires constant communication between the two. Identifying how to coordinate between these layers was difficult, but we were able to do so.

Overall, the challenges ranged from practical development ones, such as ensuring that RPCs were coordinated at the right times to the right nodes, to more fundamental questions about the system’s design, such as how to identify the optimal time interval to wait before replicating again. The latter type of challenge, in which the paper doesn’t fully specify the system’s design, could make for interesting research questions in themselves.

If we were to continue with this line of work, there are a few questions we would be interested in exploring further. First, it would be helpful to understand Beehive’s behavior when the distribution of object popularity is not Zipf. Virtually all of the latency benefits of Beehive are derived from the assumption of a Zipf distribution. Second, it would be interesting to explore the benefits of DHTs in the face of malicious nodes. The security of DHTs seems to be a major selling point, but we were not able to explore this direction.

All together, we learned a lot about the practical implementation of distributed systems, and much about distributed hash tables in particular. We were especially amazed by how much more we understood the system after attempting to implement it, as opposed to simply reading the paper.

Code

Our code is available on Github: https://github.com/grigoriy-chirkov/oversim.

References

Ramasubramanian, Venugopalan, and Emin Gün Sirer. “Beehive: O (1) Lookup Performance for Power-Law Query Distributions in Peer-to-Peer Overlays.” Nsdi. Vol. 4. 2004.
Stoica, Ion, et al. “Chord: A scalable peer-to-peer lookup service for internet applications.” ACM SIGCOMM Computer Communication Review 31.4 (2001): 149–160.