Designing a Durable P2P Storage System— Part 1

Akira
7 min readOct 5, 2023

--

Storage systems are critical system infrastructure for basically all forms of computing. Storage systems span a wide spectrum of deployment models and scales, including OS file system for local storage, distributed storage for cluster and data center scale systems, cloud storage for multi-tenant environment, and P2P storage for global deployment without centralized control.

Regardless of the deployment scenario, a fundamental requirement of a storage system is durability. Durability means that once a user stores a piece of information (a text file, a photo, a movie, or just a blob of data) into the system, they can retrieve the same data at any point in the future. The requirement makes intuitive sense: The reason users want to use a storage system in the first place is to make sure their data is never lost.

Durability is important for web2 cloud storage (e.g., Amazon S3, Dropbox, Google Cloud Storage), as data loss can result in user dissatisfaction and lost of customers. It is even more critical for blockchain and web3 applications. Unlike traditional storage systems, data stored and exchanged in these applications represent high-value assets. For instance, account records in Ethereum are worth more than $233 billion in total; the value of non-fungible tokens (NFTs) on blockchains such as Solana, Polygon, and Arbitrum can reach tens of billions of dollars by a conservative estimate. Imagine how you feel if your Bored Ape worth millions of dollars is suddenly gone due to data loss!

Why would data loss happen in a storage system? There can be many reasons, but most of them are related to a concept called failure. Suppose you are writing a report for your physics class in Microsoft Word. You clicked the Save button, and the report file is now stored on the local disk of your laptop (the actual process is much more complex, but we will ignore the details for now). Unfortunately, you later dropped your laptop, and the disk was completely damaged. Now you have lost your report and 10 hours of precious time! This is an example of a disk failure. Disk failure is just one type of failures that can lead to data loss. Other examples include corruption failure, machine failure, or even failure of the entire data center due to natural disasters (and yes, such event has actually happened before).

You may think failures happen rather infrequently. If you look at individual devices, yes, that might be true. Hardware manufacturers make sure the devices you get are fairly reliable. For instance, a hard disk usually has a mean time to failure (MTTF) about a few years. Things, however, get more tricky at larger scale. When a storage system uses tens of thousands of disks, even though each individual disk is reliable, the MTTF of the entire system drops to months or even days — meaning you would expect some disk will failure in just a few days. That is just basic mathematics!

Obviously, a storage system can’t simply lose user data when some disk fails. How do they deal with issue? The most common strategy is redundancy. By replicating a piece of data on multiple devices, the likelihood that all of these devices fail at the same time becomes much smaller (again, by basic probability theory). As long as one of these devices is still functioning, the user can retrieve their data. This simple replication technique is illustrated below.

Basic replication-based redundancy scheme. By replicating the same file on three different disks, even when two of the disks have failed, the file can be retrieved from the remaining functional disk.

Replication is the simplest form of redundancy. It is not hard to deduce that to tolerate one device failure without losing a file, the scheme requires storing two copies of the file. That is a 2x storage overhead. Can we do better than that? Turns out, we can use a technique in information theory called erasure coding to reduce the redundancy level. We are not going into the nitty-gritty of erasure coding (there are many online materials) in this blog, but use the following simple example to illustrate the concept of erasure coding.

We can divide a file into equal-sized chunks and number them in sequence. For each pair of chunks (e.g., chunk 1 and chunk 2, chunk 3 and chunk 4), we apply XOR to create a third chunk called the parity chunk. We then store all the odd-numbered chunks on one disk, all the even-numbered chunks on a second disk, and all the parity chunks on a third disk. Now, suppose the first disk failed after some time. How to retrieve the original data? We can read all the even-numbered and parity chunks from the remaining two disks, and apply XOR again. Properties of XOR ensure that the resulting chunks are the original odd-numbered chunks! But you may ask, how does all this complexity improve redundancy? Well, the first two disks combined store the original data, so we only need to consider how many parity blocks we have created. Since each pair of data chunks generates one parity chunk, the scheme effectively stores 50% extra data. That is a 1.5x storage overhead, a significant improvement over the simple replication scheme (2x). Of course, there are many more advanced erasure coding schemes that can further reduce the redundancy level.

Redundancy-based approach only works when the number of failures is below the tolerance level. For instance, the erasure coding setup above can only tolerate one disk failure; if two disks fail simultaneously, user data can no longer be recovered from the remaining disk.

How does a storage system ensure that the number of concurrent failures stay below the tolerance level (or at least with very high certainty)? Here, centralization helps. If the entire storage system is centrally managed by an organization, the operator can strategically place redundant data at locations with low failure correlation. For example, different copies (erasure coding works similarly) of a file can be stored on storage servers in different racks or clusters in a data center; they can even be placed in different data centers across multiple continents. Operators can further diversify the manufacturer and model of the disks storing those copies to minimize correlated failures (e.g., due to a bad batch of disks).

Another important technique to ensure durability is repair. The central operator actively monitors the health of the disks and servers storing the data copies. If the number of alive copies drops below a critical point (at least equal to the tolerance level), the operator creates new copies of the file and store them on healthy servers. If the repair process is done in a timely manner, we have high confidence that there will always be enough copies of the file.

A production-grade storage system obviously involves much more complexities, but the description above should give the high-level techniques commonly employed to ensure data durability. They work pretty well in practice — data losses in major cloud storage providers are relatively rare events. So why am I writing this blog post?

Because our goal is to move away from centralization, and provide a storage service in a decentralized, open environment. Decentralized storage addresses many issues of a centralized deployment, e.g., data breaches, censorship, content availability, and high storage cost. Unfortunately, decentralization also makes guaranteeing data durability much harder. Why is that the case?

First, in an open, decentralized network, there is no central party to control who can participate in the system. This is commonly referred to as the permissionless model. An implication of this property is that failure correlation among participants and their storage devices is much harder to control. Simply put, it is now hard to accurately predict which set of participating nodes are more likely to fail together, unlike in the centralized setting. Placement of data copies (or erasure coding chunks) that minimizes simultaneous failures therefore becomes more challenging.

Second, such networks often exhibit much higher churn. Churn here means nodes joining and leaving the system; non-recoverable failures are also considered, as they are equivalent to nodes leaving the system. Without central management and barrier of entry/exit, participants have the tendency to join or leave the system more frequently. As a real example, a study on the public InterPlanetary File System deployment revealed that 87.6% of the participating nodes join the system for less than 8 hours. This is significant shorter than the average time for a hard disk to fail (remember their MTTF is counted in years). Similar to the first point, such high churn can lead to more frequent and unpredictable simultaneous loss of data copies or erasure coding chunks.

Third, unlike centralized systems where all entities are under the same administrative domain, participants in a decentralized network are from diverse set of organizations (or individuals). Many of them have malicious motivations joining the network: some with large financial gains in mind, and some are just interested in hacking the system. Regardless of the motives, these participants exhibit adversarial behavior in the system, meaning they not only do not follow the specified protocol, but they also actively attempt to break the system. For instance, they can secretly delete data copies on their local disks while claiming the copies are still there. Some strong adversaries can lunch attacks such as BGP poisoning or DDoS to bring down other nodes; they can even compromise and take over honest participants in the system. An implication is that they can remove data copies or erasure coding chunks in a dynamic and targeted fashion.

Lastly, we mentioned that repair is a critical technique to maintain data durability. Repair requires global information and coordination: some entity needs to be aware when the number of alive data copies or erasure coding chunks drop below a threshold, and coordinates with the remaining storage nodes to restore the intended tolerance level. These are easy to do in a centralized environment, but significantly harder in a decentralized network.

How does our Vault protocol address these durability challenges? Stay tuned for the second part of our durable P2P storage series.

--

--

No responses yet