WTF is Data Availability?
and why solving the “data availability problem” is crucial if we want blockchains to scale
You’ve probably heard that Ethereum’s roadmap for sharding has essentially scrapped execution sharding and is now exclusively focused on data sharding to maximize Ethereum’s data space throughput.
You might also have seen the recent discussion about modular blockchains, dove into rollups and learnt about volitions/validiums, then heard about “data availability solutions.”
But maybe you found yourself a little confused at one point, scratched your head, and asked yourself WTF is data availability?
Before we dive in, it might be helpful to have a little refresher on the basics of how most blockchains work.
Transactions, Nodes, and the Infamous Blockchain Trilemma:
When you come across a new OHM fork with an APY that could make your grandparents have a heart attack, your next course of action is to smash that “stake” button, naturally. But what happens when you actually submit that transaction from Metamask?
In simple terms, your transaction goes into the mempool, and assuming your bribe to miners/validators was high enough, your transaction gets included in the next block and is added to the blockchain for posterity. This block containing your transaction then gets broadcasted to the network of the blockchain’s nodes. Full nodes would download this new block, execute/compute every transaction included within this block (including yours), and make sure they are all valid. For instance, in the case of your transaction, these full nodes might verify you are not stealing funds from anyone else and that you actually had enough ETH to pay for gas amongst other things. Full nodes therefore perform the important task of enforcing the blockchain’s rules on miners/validators.
It is because of this mechanism that traditional blockchains run into scaling issues. Since full nodes check every transaction to verify they follow the rules of the blockchain, blockchains cannot process more transactions per second without increasing the hardware requirements of running a full node (better hardware = more powerful full nodes = full nodes can check more transactions = bigger blocks containing more transactions are allowed). But if the hardware requirements of running full nodes increased, there would be fewer full nodes and decentralization would suffer — it would be dangerous if there are fewer people checking the work of miners/validators to keep them honest (as trust assumptions would increase)!
This mechanism also describes the importance of guaranteeing data availability in traditional monolithic blockchains: block producers (miners/validators) must broadcast and make available transaction data from blocks they produce so that full nodes can check their work. If block producers do not make transaction data from blocks they produce available, we would be in a situation where full nodes can’t check their work and keep miners/validators honest by enforcing the blockchain’s ruleset!
Now that you understand why data availability is important in traditional monolithic blockchains, let’s move on to how they come into play with everybody’s favorite scalability solution — rollups.
The Importance of Data Availability in the Context of Rollups:
Let’s first revisit how rollups help solve the scalability problem: Instead of raising the hardware requirements of running a full node, why don’t we just reduce the number of transactions that full nodes have to check are valid instead? We can do this by shifting transaction computation and execution away from full nodes to a much more powerful computer called a sequencer.
But doesn’t this mean we have to trust the sequencer? If full node hardware requirements are to be kept low, they definitely will lag behind the sequencer’s when trying to check its work. So how can we ensure new blocks proposed by this sequencer are valid then (i.e., that the sequencer is not stealing everyone’s funds)? Given it has been harped to death already, I’m sure you already know the answer to this question, but just bear with me here (if you need a refresher, see Benjamin Simon for an ELI5, or Vitalik’s piece to go deeper):
For Optimistic Rollups, we rely on something called fraud proofs to keep the sequencer honest (we assume the sequencer is behaving unless someone submits a fraud proof that shows the sequencer had included an invalid/malicious transaction). But if we want others to be able to compute fraud proofs, they will need the transaction data from the transactions the sequencer executes to be able to submit fraud proofs. In other words, the sequencer must make transaction data available, as otherwise, no one would be able to keep the optimistic rollup’s sequencer(s) honest!
With ZK Rollups, keeping the sequencer(s) honest is much simpler — the sequencer must submit validity proofs (a ZK-SNARK/STARK) when it executes a batch of transactions, and this validity proof guarantees that none of the transactions were invalid/malicious. Moreover, anyone (even a smart contract) can easily verify the proof that was submitted. But making data available is still extremely important for the ZK-Rollup’s sequencer. This is because as a user of said rollup, we need to know what our account balances on the rollup are if we want to ape into shitcoins. If transaction data is not made available, we can’t know our account balances, and will be unable to interact with the rollup anymore.
Notice that the above lets us see exactly why people keep shilling rollups. Given full nodes don’t have to be able to keep up with the sequencer, why not just make it a very powerful computer? This would let the sequencer execute a ridiculous amount of transactions per second, making gas fees low and keeping everyone happy. But remember how the sequencer needs to make transaction data available? This means that even if the sequencer were an actual supercomputer, the number of transactions per second it can actually compute will still be limited by the data throughput of the underlying data availability solution/layer it uses.
Put simply, if the data availability solution/layer used by a rollup is unable to keep up with the amount of data the rollup’s sequencer wants to dump on it, then the sequencer (and the rollup) can’t process more transactions even if it wanted to, leading to higher gas fees like we see in Ethereum today.
This is exactly why data availability is extremely important — guaranteeing data availability allows us to ensure rollup sequencers behave, and maximizing the data space throughput of a data availability solution/layer is crucial if rollups are to maximize their transaction throughput.
But you, the observant reader, might realize that we haven’t actually fully solved our problem of ensuring sequencers behave. If full nodes of the “parent” blockchain on which the rollup settles do not need to keep up with the sequencer, the sequencer can choose to withhold a large portion of transaction data. How can nodes of the parent blockchain enforce that sequencers actually dump the data on the data availability layer — if nodes can’t enforce this, we haven’t actually made any progress with scalability at all because we would then be forced to trust sequencers or to all buy supercomputers ourselves!
This problem is known as “The Data Availability Problem.”
Solutions to “The Data Availability Problem”:
The obvious solution to the data availability problem would just be to force the full nodes to download all the data dumped by the sequencer onto the data availability layer/solution — but we know this gets us nowhere since it would require the full nodes to keep up with the sequencer’s rate of transaction computation, thereby raising the hardware requirements of running a full node and worsening decentralization.
It is therefore clear we need a better solution to this problem, and we do have one!
Enter Data Availability Proofs:
Every time the sequencer dumps a new block of transaction data, nodes can “sample” that the data was indeed made available by the sequencer using something called a data availability proof.
How these data availability proofs actually work is very math-y and jargon-y, but I’ll try my best to explain anyway (h/t John Adler).
We can first require that the block of transaction data dumped by the sequencer be erasure-coded. This basically means that what would have been the original data is doubled in size, and then the new/extra data is encoded with redundant pieces (this part is what we call the erasure code). By erasure coding the data, we can recover the entirety of the original data with any arbitrary 50% of the erasure-coded data.
Notice, though, that by erasure coding the block of transaction data, this would require a misbehaving sequencer to withhold more than 50% of the block’s data. If the block had not been erasure-coded, the sequencer could have misbehaved by just withholding 1% of the data — so by erasure coding the data, we already have a big improvement in the confidence levels full nodes can have that the sequencer is indeed making data available.
Obviously, though, we want as much of a guarantee that the sequencer is making all the data available as possible. Ideally, we want to be as confident as we would be if we had downloaded the entire block of transaction data directly — and indeed, this is possible: full nodes can randomly choose to download some piece of data from the block. If the sequencer was misbehaving, the full node would have a <50% chance of getting fooled, i.e. of randomly downloading a piece of data that is indeed there when the sequencer is trying to withhold data. This is because if the sequencer was trying to misbehave and withhold data, remember that they would have had to withhold >50% of the erasure-coded data.
Notice, that this means that by doing the above again, full nodes can drastically reduce the likelihood of getting fooled. By randomly choosing another chunk of data to download a second time, the chance of getting fooled would be <25%. In fact, by the seventh time a full node tries to randomly download a chunk of data, the chance of it failing to detect that the sequencer is withholding data would become <1%.
This process is called sampling with data availability proofs, or, just data availability sampling. It is incredibly efficient because it means that a node can download just a small portion of the full block of data published by the sequencer on the parent blockchain and have guarantees that are essentially identical to downloading and checking the entirety of the block (the node can use merkle roots on the parent blockchain to find what/where to sample). Just to make sure I am really hammering this point home here: imagine if going for a 10 minute stroll around the neighborhood burns as many calories as going for a 6 mi/10 km run. That’s how groundbreaking data availability sampling is!
By giving full nodes of the parent blockchain the ability to do data availability sampling, we have now solved our earlier dilemma of how we can ensure that rollup sequencers do not misbehave. We can now all be happy because we can be confident that rollups are indeed able to scale our favorite blockchains — but wait, before you close this tab and find better uses of your time than watching me struggle to translate nerd-speak into plain English, remember how we still need to find a way to scale data availability itself? If we want blockchains to onboard the world’s population so our bags can be pumped (pls ser, mi famiglia), we need rollups; and if we want rollups to scale blockchains, we need to not only neuter the ability of sequencers to do evil, we must also scale data space throughput so that sequencers have a cheap place to dump their transaction data.
Data Availability Proofs are also the key to scaling data space throughput:
Currently, the most notable L1 with a roadmap focusing on scaling data space throughput is Ethereum. Ethereum hopes to do this through data sharding, which essentially means that not every validator will continue to download the same transaction data as nodes do currently (validators also have to run nodes). Instead, Ethereum will essentially split its network of validators into different partitions called “shards.” Let’s say Ethereum has 1000 validators that currently store the same transaction data, if you split them into 4 groups of 250 validators each that will now store different data, you will then have suddenly 4xed the amount of space available for rollups to dump data! Seems simple enough, right?
The problem, however, is that validators within a shard will only download and store the transaction data that is dumped to their shard. But this means validators within one shard won’t have guarantees that the entirety of the data dumped by a sequencer was indeed made available — they would only have guarantees that the data dumped to their shard was made available, but not that the rest of the data was made available to other shards.
This means we run into a situation where validators in one shard cannot be sure that the sequencer was not misbehaving because they do not know what is happening in other shards — this is where our friend data availability sampling comes in handy again. If you are a validator in one shard, you can simply sample for data availability using data availability proofs in every other shard! This will give you essentially the same guarantees as if you were a validator for every shard, and thereby allowing Ethereum to safely pursue data sharding.
There are also other blockchains that are hoping to scale to massive amounts of data space throughputs, namely Celestia and Polygon Avail. Unlike most other blockchains, both Celestia and Polygon Avail seek to do only 2 things: order blocks/transactions, and to be a data availability layer. This means that to keep Celestia/Polygon Avail’s validators honest, all that is important would be to have a decentralized network of nodes that ensure that their validators are indeed storing and ordering transaction data correctly. But since this data does not need to be interpreted (i.e. executed/or computed), you don’t need to be a full node to have guarantees that the validators are behaving! Instead, a light node that performs data availability sampling would have essentially the same guarantees as a full node has, and having many light nodes sampling with data availability proofs would be sufficient to hold validators accountable in guaranteeing data availability. This means that so long as there are enough nodes that are sampling for data availability using data availability proofs (and this is easy given data availability proofs can be computed even by phones), you can make block sizes bigger and increase the hardware requirements of validators, thereby increasing data space throughput.
Now, to recap: the data availability problem is perhaps the crux of the blockchain trilemma, impacting all of our scaling efforts. Luckily, we are able to solve the data availability problem through the core technology of data availability proofs. This allows us to scale data space throughput massively, giving rollups a cheap place to dump enough transaction data to process enough transactions to onboard the global population. Moreover, data availability proofs mean we don’t have to trust rollup sequencers, and that we can keep them honest and verify they are behaving instead. This hopefully now helps you understand exactly why data availability is so crucial for rollups to reach their full potential.
Want to go deeper? I’d suggest the following rabbit hole:
The original paper which proposes a fraud and data availability proof system to increase light client security and to scale blockchains (by Mustafa Al-Bassam, Alberto Sonnino, and Vitalik Buterin)