Implementing and Evaluating Performance of Chord and Koorde

Published in

Princeton Systems Course

15 min readMay 8, 2024

Elena Gribelyuk, Peter Halmos, Stephen Newman

Introduction

A central question in peer-to-peer (P2P) systems concerns how key-value pairs should be distributed and accessed across a decentralized network. For instance, these key-value pairs may represent file identifiers and their content. Nodes in the network (also called “users”) locally maintain a key associated with each value, such as a movie or audio file, which is distributed on a remote server. Thus, a natural question is the following:

How can nodes access values remotely while incurring minimal cost in state (space) and data-access across the network (time)?

Without a distributed topology for storing or accessing these key-value pairs, one must either incur cost in storing system state or in dataset access through a network. Previous systems, such as Napster, maintain centralized servers which store the location (IP address) of every server with ownership of a key-value pair. When failure rates are extremely low, some versions of this approach are feasible (in particular, key-hashing allows for load-balancing across machines). However, this approach incurs O(N) state and suffers from the communication bottleneck of having all access routed through a centralized database. Moreover, we often wish (in decentralized ledgers, file-sharing systems, etc.) to store files among a constantly changing (and often failing) set of machines. More decentralized systems which improved on the state cost, such as Gnutella, maintain a constant system state but require O(N) messages by “flooding” a network with messages. A naive intermediate between these two approaches is to simply maintain databases (either at the client or at separate query-helper servers) of which key-value pairs are stored on which machines. Unfortunately, when the number of machines (or even size of the dataset) becomes too large, this approach also becomes intractable. In particular, it is impossible to efficiently store information about which machines have which keys, since the spaces of both keys and machines are constantly changing.

Background: Chord and Koorde

Chord (Stoica et al. 2003) and its variant Koorde (Kaashoek and Karger, 2003) propose a novel solution to this problem. Nodes in a network are embedded in a graph which splits the ownership of keyspace among machines. By embedding both keys and machines in a single static graph and defining where key-value pairs are stored in it, this allows for both easy search of key-value pairs and robustness to node joins and failures. In particular, by regularly checking and ensuring that the properties of the graph are maintained through keyspace ownership changes, the topology itself offers guarantees on both access time and state. As an example, Chord and Koorde respectively rely on binary lookup trees and de Bruijn graphs. For both, every node (vertex) is able to store coarser data about the locations of further machines and finer data about local machines. As such, these distributed hash tables (DHT) abstract a hop through the network using recursive RPCs between machines as either a binary search (Chord) or a traversal on a De Bruijn graph (Koorde). This allows the client to identify the node which is the owner of a key-value pair using a number of hops which is only logarithmic in the total number of nodes in the network.

In Chord, all nodes and keys are SHA-hashed to some integer modulo a large number Q, allowing one to imagine the keyspace as a modular circle. If a node N immediately succeeds a key K (mod Q), so that no other node is between them, it stores the value associated with key K. Notably, hashing balances the load uniformly between nodes. Chord’s central data structure for search is the “finger table,” which stores m = O(logQ) entries that successively splits the keyspace into halves recursively. For j=1,..,2m the finger-table stores the address of the first node whose address is > (N+2j) mod Q. Additional information, such as the location of its predecessor and several of its successors is maintained for both search and robustness in the event of failures. Thus, for a key (address) k one can “hop” on the network to a server which is at least halfway to k in the keyspace from the starting server, and recursively invoke another RPC to search for k. Therefore, the total number of such jumps that must be made before reaching a server storing that key is <= 1+logQ. To ensure low collision probability, it suffices to set Q to a relatively small (> 2) power of the number of servers, allowing for speedy lookups.

Koorde exploits the structure of de Bruijn graphs to obtain a lookup-speed improvement over that of Chord. In a d-dimensional de Bruijn graph, there are Q = 2^d nodes, where each node with identifier n is connected to its own predecessor, successor, as well as the node with IDs 2n , 2n+1 (mod Q). So, in order to search for another node with ID n’, we can simply traverse the De Bruijn graph by left-shifting (<<) the initial ID n and iteratively shifting-in the top-bits of n’ from the right. For example, to search for a target node 110 from initial node 010, we would simply traverse the De Bruijn graph by hopping 010 → 101 → 011 → 110. Thus, it follows that we can search for any node n’ using at most O(log Q) hops. More generally, if we consider an embedding space of length-d words on a k-letter alphabet, then node n can store the node IDs for nodes k*n, k*n +1,…, k*n +(k-1), and searches for other nodes n’ as described previously (via left-shifts and shifting-in the bits of target ID n’). Indeed, we observe that this allows us to reach any of the Q=k^d servers using only k hops, and setting k= log Q allows us to store only O(log Q / log log Q) successor locations while traversing in only O(log Q / log log Q) hops, a nontrivial improvement over Chord. However, in reality, we do not have servers at every point in the de Bruijn graph (since we again locate them by consistent hashing); instead, by simply letting servers act as “imaginary nodes” for points between themselves and the next server, we can resolve this with constant-fraction expected overhead. Therefore, the above search algorithm should be adjusted as follows: for each node (with identifier n), degree-2 Koorde maintains the immediate successor and predecessor of n, as well as the predecessor of 2n (as noted by the authors, maintaining 2n+1 is not necessary, as it is very likely that the predecessor of 2n+1 is also the same as the predecessor to 2n). As mentioned earlier, this achieves the same asymptotic runtime as the ideal traversal of the De Bruijn graph.

System Overview

In both protocols, servers join over time by notifying existing servers. A joining server A first communicates that it is joining to an established server B. By using B’s subroutines, one can search for the correct neighbors of A. For example, B might find the predecessor and successor nodes in Chord or nodes one bit-shift away in the de Bruijn graph in Koorde to place A in the topology. This also determines the keyspace under A’s ownership. Moreover, B finds other metadata which A must store, such as existing servers needed to populate A’s finger tables. Lastly, A finds all existing servers which should be informed about its existence, migrates key-value information, and begins accepting queries as a server in the DHT.

As Chord and Koorde maintain a DHT subject to dynamic node failures, they must periodically self-stabilize. In particular, they must update their lookup data structures to maintain a correct representation of the network topology. Furthermore, both protocols require servers to periodically check their finger tables or other known nodes to make sure that they are live, and if they are not, use the joining functionality to repair their finger tables. They also propagate knowledge of the failure to other servers, allowing them to store the dropped key-value pairs to maintain redundancy.

Our Implementation

We implement partial versions of Chord and Koorde (without redundant storage) in Python and we further extend Chord for use in the SimGrid environment for evaluating relative performance (since we lack access to a real network of sufficient size to observe meaningful performance scaling). For various reasons (mentioned in greater detail in the discussion section below), we were not able to fully integrate Koorde into SimGrid, so we only provide performance graphs for our Chord implementation (in SimGrid) to compare performance with those of the Chord paper. To compare the performance of Koorde and Chord directly, we construct a simple emulator from scratch, which essentially performs many node joins and lookup operations for both. In then next subsection, we describe the fundamental setup of our approach for integrating Chord into the SimGrid emulator.

Integrating DHTs into SimGrid

We simulated networks of actors (of various sizes) as well as clients querying under the closed-loop model. We implemented Chord under the simple assumption of one successor finger and our implementation of Koorde is for k = 2, i.e. search is simulated in the degree-2 de Bruijn graph. We chose to implement these versions in order to better compare the performance of Koorde and Chord while controlling for the same asymptotic runtime.

While the simulation environment rendered experimentation much less costly, it introduced substantial complexity and problems into the implementation. SimGrid presents three core abstractions: Hosts (machines), Actors (processes), and Mailboxes (for networking). In the next paragraph, we describe how restrictions on the latter two abstractions caused substantial issues over the course of our implementation.

In some ways, Actors provide an excellent abstraction for individual servers. Unfortunately, they are also required to be single-threaded. While this is not an issue for some aspects of the implementation (such as clients), individual tasks that servers must do often require a sequence of continuation-blocking RPCs to other actors. However, to avoid deadlock, servers must be live to respond to several simultaneous RPCs while their own RPCs are blocking, thus requiring that each server either be represented by a collection of actors or be intensely stateful (i.e. knowing what blocking RPCs it has out, what it wants to do with the results, etc.).

We proceeded with the former approach. This may have been a mistake. While collections of actors can nominally achieve all of the same functionality as multithreading, it is difficult to share state between two actors as it changes. As an example of where this becomes problematic, consider the stabilize operation of Chord and Koorde: a node must perform several (recursive) RPCs to other nodes to determine successors, and then update its own state with the results. We cannot have a primary thread perform these RPCs in a blocking manner, but neither can we have a secondary thread without allowing it to update the state of the primary. Moreover, this state update should not always occur: stale data should be ignored, and so on. We eventually implemented a sufficiently robust scheme of RPCs, imperative remote updates, etc. to effectively allow us to layer threading on top of the actor-mailbox system. From here, we perform nearly all blocking operations by spawning a transient thread to complete said communication and die.

The difficulty of the above was compounded by several idiosyncrasies of the networking primitive, Mailboxes. A SimGrid mailbox, identified by name, allows actors to either put() or get() messages to/from it, with corresponding non-blocking versions. Messages are delivered in order. In many ways, this abstraction is extremely helpful. However, it interacts poorly with transient actors. In particular, if actors die as they are sending (or, more commonly, receiving) messages, the simulator crashes, providing either an unspecified NetworkError (if you get lucky) or a segfault (if you get unlucky), and preventing you from recovering any information about simulation state at that point or prior. This made debugging challenging.

Mailboxes also exhibit some performance penalties. To better simulate actual networking, mailboxes may be bound to receiving hosts, allowing messages to be transmitted before the get() request is made (and thus better simulating actual networking). However, doing this prevents effective garbage collection (especially when mailboxes must be accessed by multiple actors, which seems to confuse the GC). Mailbox GC becomes an even more substantial performance problem in the case where you are spawning many transient actors, all of whom need their own mailboxes (since accessing a mailbox removes a message, we need per-actor reception mailboxes in most cases to guarantee receipt). Many of these mailboxes are also accessed (programmatically via information in the query message received from the transient actor) by stable actors. This prevents effective garbage collection of the mailboxes after the transient actors die, since the stable actors are still alive and the garbage collector is neither clever enough to know that they won’t be used again nor kind enough to expose manual garbage collection.

Failure to implement Koorde in SimGrid

Despite substantial effort, and our successful SimGrid implementation of Chord, we were unable to produce a working SimGrid implementation of Koorde. While it is still unclear what went wrong, we believe that the issue stems from subtle bugs in our implementation of update_others (the function which, as a node joins the network, updates the next pointers of relevant nodes. As discussed elsewhere, the original Koorde paper does not describe update_others, and we were unable to find an implementation of Koorde with working update_others. While we believe that our implementation of update_others is correct, there may be a subtlety in the pointer assignments which was not accounted for while integrating the code into SimGrid. Debugging this function was difficult for the reasons described above — after completing the rest of the implementation, we spent ~15 person-hours working exclusively on this before conceding. In brief, we failed to duplicate fully distributed Koorde due to lack of specification of update_others, and a failure on our part either to accurately determine or accurately implement its desired behavior in the simulator (we are still unsure which).

Evaluation and Results

Our experimental results can be divided into two parts:

First, we show the performance of our SimGrid-integrated Chord implementation by computing the latencies and number of hops per lookup as the number of live nodes in the network increases.
Next, we compare the performance of Koorde and Chord for a locally-run simulation.

Our implementation and experiments may be found at the following GitHub link: https://github.com/peterhalmos/COS518_FinalProject

Performance of SimGrid-Integrated Chord Implementation

The graphs for the first point are given below. Specifically, we validate the performance of our Chord implementation in SimGrid by plotting the lookup latency as the number of nodes in the network increases, as well as the expected number of hops as the number of nodes increases. Notably, both of these plots closely replicate those given in the original Chord paper.

Comparison of Chord and Koorde on a local simulation

Next, we consider two main evaluation benchmarks to compare Chord and Koorde using our local simulation. Specifically, we first evaluate the load-balancing of the two DHTs in assigning keys to the various nodes in the network. In both cases, we expect the numbers of keys per node to be Poisson-distributed. We plot the results for Chord and Koorde (with identical keys, hashes), provided below:

The plots given above are created using 256 nodes and 10*256 random keys for M=25, and Q=2²⁵. Importantly, this demonstrates that both Chord and Koorde have the same return of successor nodes to a given key from this respective find_succ function as the node IDs and keys are identical between both tests. Moreover, the bucketing follows a Poisson distribution as expected, implying no errors in network connectivity or key generation/distribution. This also demonstrates that our Koorde successor-finding works, although an excessive number of hops is required by our lookup implementation as seen in the plots below:

Above, our results begin to deviate from expectation. While Chord demonstrates expected behavior, the number of hops required in Koorde is far higher than it should be — linear in the number of nodes, as opposed to the logarithmic scaling predicted by the paper. We give a more thorough discussion of a possible reason for this outcome in the discussion section below. Finally, we provide the distribution of the number of hops taken by a Chord lookup; the graph below is for a key-space of size 2²⁵ and 256 nodes in the network. Again, Chord performs as expected in this case, as we see that the distribution is concentrated around the expected number of hops, which is O(log N).

Discussion

Discussion of Results for Chord in SimGrid & Local Chord

Our Chord implementation exhibits expected behavior across all tests, conforming to the results and predictions of [1]. In particular, the number of hops to answer a query is (symmetric-) binomially distributed, and both the expected number of hops and query time scale logarithmically in network size. The only deviation in our tests is the high initial lookup latency (even for very small networks) exhibited in the SimGrid tests. This results from a specific aspect of our simgrid implementation: every find_succ query causes the receiving node to spawn a single actor, responsible for both carrying out the sequence of calls needed to answer the query and returning the result. Spawning an actor in SimGrid appears to be fairly expensive in simulated computation time (though the corresponding action in a non-simulated implementation, spawning a thread, is extremely cheap), resulting in the consistent bump in cost.

Discussion of Results for Local Koorde

Our simulation results illustrate substantial changes from our expectations for the performance of Koorde. First, note that several components (key distribution and core connectivity) are as expected. However, although our implementation correctly finds the successor, predecessor, and “next” (i.e. predecessor of 2*ID) pointers for each node, we observe a total breakdown of the fast search procedure, with many queries taking O(N) hops, as opposed to the desired O(log N). We believe this may be due to the ambiguity in the lookup algorithm and “imaginary” node selection provided in the Koorde paper.

To better describe the main difficulty in the Koorde implementation (and also why our implementation requires a linear number of hops, much greater than we expected), we give some more background on the Koorde algorithm described in the paper. As described in our introduction, recall that the Koorde algorithm is defined as follows: if the current node has ID n, then node n can perform a lookup of node n’ by simply traversing the de Bruijn graph, i.e. by “shifting-in” the most significant bits of n’ one at a time. Notably, this corresponds to hopping from n → (2n || top_bit(n’)) mod Q → … → n’, where x || y means that we adjoin bit y to bit-string x on the right . When every node of the de Bruijn graph is present in the network, this solution is clean and simple: it is relatively easy to see that O(log Q) hops suffices to reach any destination n’ from n. However, the main challenge is handling the case where not every de Bruijn graph node on the path from n to n’ actually appears in the network — indeed, this sparse case is the more important (and realistic) case to consider for our setting of DHTs.

To handle this case, the paper describes extending the above (simplified) algorithm as follows: instead of performing hops on the real de Bruijn graph, the algorithm describes the idea of selecting an imaginary starting node i, and at each step, the algorithm simulates a hop from i to i || top_bit(n’), thus shifting in the correct bit-values for the target node n’. While this approach seems plausible, many crucial details were omitted that would allow this algorithm to be implemented as intended. For instance, through our time studying this approach, it has become clear that the way in which the “best imaginary node” is selected before performing the simulated lookup(n’, i) operation will greatly affect the number of hops needed to arrive at the target node n’. This procedure is not provided by the paper; the paper did briefly mention one possible approach, which we tried, but this approach did not seem to improve the asymptotic number of hops for a lookup. Notably, the lookup function (excluding the procedure to select the imaginary node i) was the only function that was explicitly given in the paper, and we followed their specifications exactly. However, after many hours of debugging and trying to fix our implementation, we believe that the lookup function (as well as the selection of imaginary node i) would need to be significantly adjusted from the presentation in the paper in order to achieve the desired functionality. We believe this to be the case with our current selection of the best “imaginary” node.

On a separate (but related) note, we had difficulties in implementing the procedure update_others, which should update all of appropriate “next” pointers (self.next := predecessor of 2 * self.ID) of nodes that precede the joining node n and for which n should be their new “next.” While we believe we correctly implemented this function, it was also not provided in the paper, so this made debugging challenging.

References

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan. 2003.
Koorde: A Simple Degree-Optimal Distributed Hash Table. M. Frans Kaashoek, David R. Karger. 2003.