Thanks for C.C. Liang’s for providing materials and ideas.

# Overview

In current blockchain design, most of them have shard chains design which can increase transaction speed. However, higher transaction speed means larger data size, and the larger data size means fewer people can run the full nodes (because the hardware and maintenance costs are higher). For example, Ethereum 2.0 is designed with 1024 chains in the beginning (now is 64 shard chains). It’s impossible to download all shards, and it also loses the meaning of sharding. But what if a validator of shard A wants to get some data in shard B?! It’s not a good idea to download all the blocks form shard B. It takes too much time and space, and if doing so, everyone will eventually download all the shards. How to verify transactions without downloading all shards? This is our topic today, data availability.

Before we discuss data availability problem, let’s start from what is “fraud proof”.

There are two ways to verify data: validity proof and fraud proof. A validity proof is how blockchains work now — “Verify the correctness of data and then submit on chain”. For example, when you want to transfer money to someone, miners need to verify your balance first (verify data). Only if your balance is sufficient, miners pack your transaction. Fraud proof is in opposite way. Validators receive your transaction first, and waits for challenges. If no one raises a challenge for a period of time, your transaction is accepted. The cost is lower with fraud proof so most of L2 solutions choose fraud proof as data validation solution.

The light clients only download block headers, how can it be as reliable as the full nodes? Let’s check the following example first.

# Fishermen

The fishermen is a very common design. Fishermen observe the transactions and check if anyone is doing bad in the network. If so, sending a fraud proof to get a reward.

The incentive structure here is simple:

1. If someone sends an invalid fraud proof, the one loses their deposit

2. If someone sends a valid fraud proof, the one who creates the invalid block will lose deposit. Some small portion of the deposit goes to the one who sends the fraud proof.

It sounds a simple and working solution. However, fishermen is a solution here. Let’s consider two examples here,**Case 1:**

T1: a malicious node, V1, publishes a block with missing data

T2: a good node, v2, raises a fraud proof

T3: the malicious publisher, v1, publishes remaining data

**Case 2:**

T1: a good node, v1, publishes a block with all data

T2: malicious node, v2, raises an invalid fraud proof

What if a fisherman take a nap, miss what happened at T2, and then wake up at T3. It is impossible to distinguish the difference between the two situations (all data were published in both cases at T3.) In the above example, is v2 in case1 rewarded? 1. If so, attackers can make money by raising invalid fraud proof (because the two are indistinguishable.) 2. If you want to punish v2, then no one is willing to make fraud proof. 3. If doing nothing, it provides a free DOS attacks. Therefore, there is a fundamental problem in this mechanism, that is, it can not effectively prevent the attacker from hiding information (because one got caught, then publishing the remaining). There is no way to distinguish between good guys and bad guys by fishermen mechanism.

# Reed-Solomon Erasure Code

Continuing the above problem, what if we randomly choose part of data to verify? We split a block into N chunks, and a light client randomly picks M (M < N) chunks to verify (by verifying the merkle root in the block header). If any M of the N chunks are correct, light clients accept the block. With very high probability that the majority of the block is available.

Attacking aside, let’s say there are 99% data availability, which means still 1% may not be available. It causes that blocks can’t be recovered even only dozen bytes missing. So, If we just randomly sample chunks, it needs 100% data availability. Which means still need to download a whole block. However, we can alleviate this problem by erasure codes!

What is erasure codes?!

Erasure codes allow partial data missing but still recoverable. Commonly used in network communication, DVD or storage like RAID. You can simply imagine that just appending more data to original data so it could tolerate some pieces missing. There are many types of erasure codes, and they are also applicable to different fields. In Vitalik’s proposal, using RS (Reed-Solomon) erasure code which is a simple code method. Conceptually, Lagrange interpolation is used to create redundant backup data.

Let’s say a block is divided into M chunks, and then extends the M chunk to N chunks(N > M) by RS erasure code. Therefore, only any M of the N chunks can restore the data. If nodes do not provide data or lost part of the data, the blocks can still be restored. If N=2M, it means that only 50% data availability comparing the original 100%.

For example, assuming M = N, the block size is 1 MB, and every 256 bytes is split into one chunk. Now, we have 4096 (M = 4096) chunks and then combine these chunks into a merkle tree (there would be 12 layers deep). Then, randomly choosing 20 chunks (there is a relationship between a number of light clients and sample numbers) to validate, the size of light client proof is

12,800 bytes(20 branches * (256 bytes + 12 Merkle tree intermediate hashes * 32 bytes per hash) )

The size of a fraud proof is about1.5MB(4096 chunks * 12 Merkle tree intermediate hashes * 32 bytes per hash)

If using 1D erasure code, the size of a fraud proof will be larger than or equal to the size of the entire erasure code (in the above example, the data size is 1 M, but a fraud proof is about 1.5 M). If the data is extended to multiple dimensions, such as 2D, although the size of the entire erasure code becomes larger, other benefits can get. We interpret the data as a square (sqrt(M) * sqrt(M), x = 0~sqrt(M) - 1, y = 0~sqrt(M) - 1), and then extend the data to bigger square with 2D RS erasure code. We can get N = 2 * sqrt (M), as follows:

Instead of a single merkle tree, now we have 4*sqrt(M) merkle trees（one for each row and one for each column，as below diagram）

Let’s get back to the above example, the block size is 1MB, and each chunk is 256bytes. We can get a 64x64 square and have 4*64 merkle trees. In this case, we need 48 branches. The size of light client proof would be :

25,600 bytes( 48 branches * (256 bytes + 6 Merkle tree intermediate hashes * 32 bytes per hash)+ (128 Merkle roots * 32 bytes per hash) )

The size of a fraud proof :12KB

Here, we see that the entire data size become larger, but the size of fraud proof is greatly reduced. Plus the number of layers of the merkle tree is reduced, the performance is greatly improved. The following figure is the relationship among the number of samples, the number of chunks and the number of light nodes (S: number of samples, k: sqrt(M)). Check the paper for more details if you’re interested.

However, in general cases, you know who your peers are. Attackers can stop responding some requests from some specific peers to trick these nodes. We can overcome by using onion routing network to hide where the requests come from. Plus the challenge of fraud proof, only one honest full node in the network is enough throughout the design.

The protocol between the full nodes and a light node is as follows.

- The light client receives a new block header from one of the full nodes it is connected to, and a set of row and column merkle roots
- The light client randomly chooses a set of unique (x, y) and sends them to one or more of the full nodes it is connected to (x, y corresponding to different row and column)
- Once a full node gets the request, it provides corresponding data and specify if it is associated with a row or column root （there are two possible Merkle proofs for each share; one from the row roots, and one from the column roots）
- For each share that the light client has received, the light client verify the merkle root.
- Each verified share would be gossiped to all the full nodes that the light client is connected to if the full nodes don’t have the share. Then, keep gossiping until every full nodes have it.
- If no shares are missing from step2, then the block is acceptable as available if no fraud proof within a period of time.

Here is the current proposal of Ethereum 2.0 light client. #1194 is a new thread about inefficiencies of data availability proofs. The main issue here is 2D erasure code may on average double the amount of data. And due to EIP1559, on average a block will be half full, which means that a lot of padding zero are required. So, there would be a lot of meaningless data inside. This is still an unsolved problem.

*Thanks for the help and suggestions from C.C. Liang (Ethereum Foundation Researcher) to complete this article*

References：

A note on data availability and erasure coding

Fraud and Data Availability Proofs

Unsolved Problems in Blockchain Sharding

詐欺證明與有效證明 (it’s Chinese)