Kademlia: the P2P System Behind Ethereum and BitTorrent Networks

Iev Strygul
13 min readJul 23, 2021

--

All the fuss around blockchain technology made me curious about how it looks under the hood. So I started to read about the implementation of the Ethereum network and found a genius in its simplicity and efficiency system that solves peer-to-peer (P2P) networks communication problems. It is called Kademlia and I want to share with you how it works.

Kademlia is being described as a distributed hash-table. In my understanding, it is more a specification for a P2P network that has a set of significant benefits over other alternative systems in terms of efficiency, stability, and security of the system. And if you ever wondered how P2P networks like blockchains and torrents work or wanted to build one yourself… This article, unfortunately, won't tell you. Simply because the subject is large enough to write a few dozen books on it. However, it will explain to you the communication protocol behind these systems without which one cannot have a complete understanding of how these P2P networks work.

It took me quite some time myself to digest all the information on how Kademlia works, so I hope I can offer an easier explanation of the subject that one may come across the Internet. I also made some illustrations that should help you to wrap your head around it. I should make a disclaimer that I am not an expert in the field, just some enthusiastic amateur trying to understand how things work and who wants to help others to do it faster and easier. So if you find any inaccuracies, please let me know, and we can improve the material together.

P2P Network And Its Problems

Briefly, for those who are new to P2P networks, in their simplest form, they are just computer networks that were created by two or more computers that share resources and that communicate with each other directly instead of a common server, as most of the communication on the Internet happens today.

Since P2P networks have no centralized server, a set of communication problems arise. For example, how to ensure that information stored in the network where participants always go online and offline is always available?

A good illustration of the problem is a BitTorrent network that allows uploading and downloading resources to/from other users of the network. If we simplify things, technically it works as follows: someone uploads a resource on the network that you want to download (a new movie with Justin Timberlake), so you download a .torrent file into your BitTorrent client, the client connects to computers of other network participants (called "network peers") using the information in the.torrent file, and then the client downloads pieces of information of the resource that you need from them.

Cool right? The peers' discovery part (that is when we find the computers of other peers that have the movie that you want to download) is quite tricky and complex. Considering that the size of the network could be tremendous, containing thousands and millions of peers that hold different resources and randomly go offline and online, finding a subset of computers (that we will call further on "nodes") that are available on the network at a particular moment, and that store the resource that you look for on top of that — is a fairly challenging task.

Another problem is that when a network is huge it is super inefficient to go to each computer and ask if it has the piece of information that you look for.

In addition, public P2P networks, as those where everyone can easily get access to, are prone to DDOS attacks. It can result in a loss of a group of nodes, making the information stored on those nodes unavailable.

This is why Kademlia became super popular among the developers of P2P networks: it offers a super-efficient way for internode communication while ensuring resiliency against attacks, downtime, and central points of failure.

Let's have a closer look at it.

Her Majesty Kademlia

The idea was originally introduced by Petar Maymounkov and David Mazières in 2002 in their paper "Kademlia: a Peer-to-peer Information System Based on XOR Metric" (that I strongly advise reading to everyone who wants to get a good understanding of the subject). Since then, quite a few variations of the protocol appeared that simplified it (like in Ethereum), or made it even more efficient.

From the beginning, Kademlia attracted quite some attention in CS society because it was the first implementation of a hash-table that simultaneously offered latency-minimizing routing, symmetric unidirectional topology, consistency, and high performance. Today, it is probably the most popular implementation of distributed hash tables.

To understand Kademlia better, I suggest looking at it from two perspectives: how it stores the data (the data structure) and how it defines the communication between the nodes (the communication protocol)

Data structure

At the heart of Kademlia’s data structure — is the routing table. This is where each Kademlia node stores all information it needs to communicate with other nodes. It is like a telephone book with the numbers of other computers on the network.

There are three elements that a node should node about the other node to be able to connect to it:

a) IP-address

b) UDP port

c) node id

The information is stored in a key-value format, with node id as the key. The triples of data will be kept in the routing table in special buckets that are called "k-buckets". The number of buckets in each routing table is equal to the number of bits in node ID. This value is the same for each Kademlia node.

To give an example, assume that you have a system with 16 nodes. Their IDs lie in the range between 0000 and 1111. A routing table of every node on the system would contain the triples of data (IP, port, id) of any other node in one of the 4 buckets (it could be less, but never more, because the maximum number of buckets in the routing table is always ≤ the number of bits in the node ID).

When Kademlia decides which of the buckets to put a node in, it evaluates its ID based on the notion called "distance". It has nothing to do with an actual geographic distance between the nodes, it rather refers to the bitwise XOR operation.

In other words, to define the distance between two nodes, A and B, we retrieve the result of:

(node A id) XOR (node B id) = Kademlia distance

To give an example, if node A had id = 0101 (which is equal to decimal 5), and node B had id = 1100 (decimal 12), the distance between them in the Kademlia terms would be 1001 (decimal 9):

0101 XOR 1100 = 1001

This notion of distance allows allocating nodes among the k-buckets between 2ⁱ and 2ⁱ⁺¹ far from itself, where 0 ≤ i < (number of bits in node ID).

Each k-bucket can contain a list of up to k-number nodes where k refers to some system-wide replication parameter. This is where the buckets take their name of ”k-buckets” from.

Nodes Distribution and Routing Table

How the node IDs will be distributed in a routing table of node 0101?

Initially, the routing table will contain only one bucket that would cover all the ID space. So when adding the first node, it will end up in this single bucket, whatever node ID it has.

When the node u learns about a new node, it will add it to one of the existing buckets. If that bucket is not full, it will simply add it. However, if it is full, it will check if the bucket’s ids range includes u’s own node id. If it does not, then the node is just being dropped. If it does, and if the number of existing buckets is not bigger than the number of bits in the node id space, the bucket is being split in two.

To better understand how nodes are distributed in a routing table among different k-buckets and how the buckets' splitting occurs, let’s have a look at an example of the node's 0101 routing table.

All nodes in Kademlia’s routing table are being stored in a tree data structure. In the beginning, the routing table will contain only one bucket that will cover the full range of nodes' IDs, so when we add a new node, it will be added to this bucket.

When the bucket is full and we try to add a new node to it, Kademlia splits the single bucket in two, based on the hosting node's ID, so all the nodes whose IDs start with 0 (from 0000 to 0111) will fall into the left bucket, and all the nodes whose ID starts with 1 (from 1000 to 1111) will fall into the right bucket. In terms of XOR distance, the left bucket will contain all the nodes that are less than 1000 far from the hosting node's ID, and the right bucket will contain all the nodes that are more or equal to 1000 far.

The same scenario occurs then for the left bucket, where the IDs range includes the hosting node's own ID — 0101: when it is full, it is being split, and then one of its children is being split, and so forth until the routing table reaches the maximum number of allowed k-buckets.

Since the maximum decimal number that you can express with 4 bits is 15 (binary 1111= decimal 15) in binary, a Kademlia system with a 4-bits ID range can contain up to 16 nodes with ids ranging from 0000 (= decimal 0) and 1111 (= decimal 15).

Therefore, a node's routing table will contain up to 15 other nodes in its routing table. These nodes will be distributed among up to 4 k-buckets. The maximum number of k-buckets is limited by the number of bits in the ID range (which is 4 in our case). Each of the 4 possible k-buckets will contain nodes that are 2⁰ (=1), 2¹ (=2), 2² (=4), and 2³ (=8) far, respectively, from the node that hosts the routing table, based on the XOR notion of distance between its id and id of node that it stores.

If you look at the contents of the buckets on the picture above and calculate the XOR distance ranges for each of them, you will come to the following:

Bucket A contains node IDs that are ≥ 4 and < 8 far (in terms of XOR distance) from the hosting node 0101.

Bucket B contains node IDs that are ≥ 1 and < 2 far.

Bucket C contains node IDs that are ≥ 2 and < 4 far.

And bucket D contains node IDs that are ≥ 8 far.

You can also see that the largest bucket will be bucket D where 1/2 of the node IDs will fall in. Then comes bucket A with 1/4 of all nodes and bucket C with 1/8. The smallest bucket with the least nodes will be bucket b which will contain only a few nodes that are closest to the hosting node.

This property is very important for the routing table because it allows the node to know all the closest nodes while knowing only a subset of nodes that are further from it, so a request can arrive on one node and be routed further to the node which is closer to the source that you requested, and from there to an even closer node, and so on until you hit the node that contains the requested resource.

In the image below, you can see an example of how the node with ID 0011 finds the node with ID 1110 by successively learning of and querying closer and closer nodes. It first accesses node 101 (1), already known to it, and gets from it information about the next node 1101, does a lookup of it (2), and gets information about the node 11111, then it looks up the node 11111 (3) and, finally learns about the location of the node that it was looking for — 1110.

Protocols

We have seen Kademlia's routing table that stores node ids based on their XOR distance. It allows a node to store information about all/almost all nodes that are closest to it while keeping only a subset of the nodes that are further away from it. This implementation favors routing of requests from one node to another until the request finds the node that holds the requested data.

The only piece that is missing to understand how Kademlia works is the protocol, or in other words: how the nodes interact with each other in an unstable environment.

Nodes come and leave the network. To ensure that requests always find the destination node (the one that holds the requested resource), there should be a way to keep information about all alive nodes up to date. Kademlia has a few protocols to do it that consist of 4 RPCs:

  • PING — verifies if a node is alive
  • STORE — stores a key/value pair in one node.
  • FIND_NODE —returns the k number of nodes from the recipient's k-bucket that are the closest to the requested node id
  • FIND_VALUE —returns the value that corresponds to the requested key if the node has it, otherwise acts as FIND_NODE

Bootstrapping

Let's see how a new node joins the network. For a node to be able to join the network, it should know IP and port number of a node that is already participating in the network (the "bootstrapping").

As the first step, the joining node puts the metadata (the triple of IP, id, and port, as you recall) of the bootstrapping node in one of its own k-buckets. It also checks all the key/value pairs that it stores and if the new node's id is among the k closest nodes to the id, it issues STORE request to the new node and transfers the key/value pairs. To avoid unnecessary RPC calls, it issues STORE requests only if the new node's id is further from the key than its own id (but still among the k nodes closest to the key).

Then, the joining node executes a recursive operation known as "lookup" where it sends the FIND_NODE request to the bootstrapping node with its own id.

The bootstrapping node returns ids of the nodes that are the closest to the joining node and puts the joining node's id in one of its k-buckets.

The joining node receives node ids of the nodes that are in between the bootstrapping node and itself, puts their metadata in the routing table, and takes α number of nodes that are closest to the lookup key and that the joining node has not queried yet( where α is a system-wide number) and conducts a "lookup" again against these α nodes. The nodes that fail to respond quickly are not being written in the routing table until/unless they respond. In this way, the joining node populates the routing table only with the nodes that are active on the network. The lookup terminates when the joining node has queried and got responses from the k-closest nodes that it has seen.

In this way, the joining table populates its k-buckets with nodes whose ids lay between the bootstrapping node's id and its own id.

Then, the joining node needs to populate the k-buckets that are further than the bootstrapping node's k-bucket. To do it, the joining node conducts a "lookup" of a random key that falls in the range of the remaining k-buckets.

Storing A Key/Value Pair

To store a key/value pair, a node locates k closest nodes to the key (the same lookup as described earlier) and sends them a STORE request.

Periodically, each node republishes the key-value pair:

a) to ensure persistence in the environment where nodes can leave the network to guarantee that there are always nodes that keep the pair

b) to ensure that new nodes that are closer to the key contain the pair and no failed requests to them occur

As said earlier, when a new node joins the network, it must store all the key/value pairs that are closest to it. In this way, Kademlia ensures that the data is not being lost when nodes that keep the key/value pairs go offline and that the new nodes always contain key/value pairs that have the closest key to the node's id.

Keeping System Up To Date

The request traffic between the nodes ensures that k-buckets always contain the recently confirmed nodes to be alive. However, there could arise situations when IDs in a particular range are not being queried for some reason. To solve this problem, nodes perform periodically "refreshing" of the buckets by which they pick up a random ID and do NODE_LOOKUP for this ID.

Since nodes constantly join and leave the network, a situation can arise when all the nodes that contain a particular key-value pair leave the network. Moreover, new nodes can join the network that could be closer to the key than the nodes on which the key-value pair was originally published. To solve this. Kademlia republishes the keys once an hour. There are some optimizations to the process to make it more efficient => http://gleamly.com/article/introduction-kademlia-dht-how-it-works

P.S.

There is obviously much more to learn about Kademlia than this article suggests. I tried to explain the key elements of the system in a way that could be more accessible than most of the material I could find online. Whether I succeeded or failed — you are to judge. If there is anything I could improve in the article, I would be glad to hear in the comments.

And if you got so far reading this article — thank you! Your time — is the biggest compliment I could get for all the effort I took.

--

--

Iev Strygul

Forging software at the hottest Scandinavian scale-up -- Dixa. Messing with data making it useful. Love simplicity and strawberries with cream.