Can SNARK compress blockchain data in practice?

Published in

Tokamak Network

8 min readJul 3, 2024

Note: I have experience writing an information theory-applied research paper and a SNARK design research paper. The conclusion of this article is only a personal opinion derived from past experiences and has no academic basis.
Background knowledge: This article requires an understanding of SNARK and verification circuits. Please refer to my previous post for a brief explanation.
한글 번역본은 여기서 찾을 수 있습니다.

Can SNARK compress blockchain data?

This question was motivated by Ian Miers’ presentation ‘Scalability is Boring, Privacy is Dead: ZK-Proofs, What are They Good for?’ at Devcon6 held in Bogota, Colombia in 2022.

Scalability is Boring, Privacy is Dead: ZK-Proofs, What are They Good for?

The first mainstream uses of zero-knowledge(zk) proofs were for private payments in systems like Zcash and then…

archive.devcon.org

To conclude, my answer to this question is as follows:

SNARK can compress blockchain data, but it inevitably results in degeneration in blockchain security or decentralization.

This article explains the reasons for the above answer from the perspective of information theory.

Information Theory?

Claude E. Shannon (picture from Wikipedia)

Information theory is an applied mathematics that has been studied since the 1920s and is the study of the amount of information contained in random variables. Information theory begins with measuring the amount of information, called entropy, of variable X. Claude E. Shannon, the father of information theory, defined entropy as follows:

Here, the domain of addition is the set of all possible events that X can output. The unit of entropy is “bits,” which is the same unit we use in computer science to measure word size, bits (you can also use other units for entropy by changing the base of the logarithm function). The entropy is designed to be larger as the variable becomes more uncertain, that is, the more uniform the probability distribution of the events. Conversely, if the variable is deterministic, the entropy equals 0 bits.

Let X be a random variable that outputs either 0 or 1. This graph shows the entropy of X according to the probability of X=1 (image from Wikipedia)

Information theory explains the theoretical limit on the compression rate of non-deterministic data: in order to represent data without losing any uncertainty implied by a non-deterministic source X, the theoretical result says that the dimension of the representation medium (e.g., a field or a vector) must be at least equal to or greater than the entropy of X.

Examples of utilizing the entropy

For example, suppose we throw a dice of 8 sides and observe the value on the top side. Let X be defined as a random variable of the value. By definition, the entropy of X is 3 bits. Information theory tells us that to record the observation of X, we need a binary string at least 3-bit long.

Let me give you another example. Suppose we are recording the observation of a random variable Y in a 3-bit long binary string, and the following rules are found from the observation: 1) at least one of the three bits must be 1; 2) If the first or second bit of the three bits is 1, the next bit must be 0. The only possible observed strings under these rules are 001, 010, 100, and 101. Assuming that these outputs occur equally likely, that is, each with a probability of 1/4, the entropy of Y equals 2 bits. If you want to reduce the capacity of an existing 3-bit string that is storing an observation of Y by any signal processing, the capacity of the result must be equal to or larger than 2-bit.

These results for one-dimensional random variables can be extended to multidimensional random vectors (vectors of random variables). The limit on lossless compression of a vector (array in programmers’ terminology) is determined by the entropy of the vector. In other words, compression of a vector into a vector with a dimension smaller than the entropy inevitably results in loss of information, which means that it cannot be returned to the original vector. Examples of lossless compression algorithms that we are familiar with are zip, rar, or 7z, and an example of lossy compression is converting a high-quality Tiff format image into a low-quality JPEG format.

Are SNARKs compression algorithms?

Let’s return to the question “Can SNARK compress blockchain data in practice?” To answer this question, let’s first assume that SNARKs’ provers are compression algorithms. So, which type does the SNARK prover fall into: lossy compression or lossless compression? To check this, let us first look into the uncertainty in “witness”, which is the data that the SNARK prover is trying to compress.

The witness is a private vector that a SNARK’s prover creates and discards during proof generation. Under the assumption that the SNARK prover is a compression algorithm, the witness can be said to be the original data before compression. From the perspective of SNARK’s verifier, the witness is non-deterministic and cannot be predicted at all without hints, so it can be said to have the maximum entropy. That means there is no room for lossless compression. According to the information theory we discussed above, it should be impossible for a SNARK verifier to restore a witness from the proof without loss.

A goal of SNARKs

Information theory was negative about the SNARKs’ compression capability. However, SNARK’s goals may be different, so let’s check them carefully. Which type of compression does the SNARK prover aim for: lossless or lossy? The answer can be found in the definition of knowledge soundness, the second of SNARK’s three security definitions: completeness, knowledge soundness, and zero-knowledge. The point of Knowledge soundness is “Is there a hypothetical extraction algorithm (extractor) that can restore a very large dimensional witness from a very small dimensional proof with overwhelming probability close to 1?”. In other words, the SNARK prover aims for lossless compression.

Since most SNARK protocols have proven knowledge soundness in their papers, at first glance, it seems as if they have designed a prover that can perform lossless compression even though it deals with data that cannot be losslessly compressed information-theoretically. And most of you who have heard of zk-rollup will agree with that. However, there is a fatal assumption hidden here.

A fatal assumption behind SNARKs for blockchain data compression

An implicit assumption behind SNARKs is that the verifier has access to Oracle polynomials. A set of these oracle polynomials is called the verification circuit and gives rules to the witness. The verification circuit is also random vectors or matrices having nonzero entropy. Thus, this assumption implies that the verifier already has a very large amount of information. The SNARKs’ hypothetical extractors that restore a witness from a proof could exist only under this assumption (we are only talking about the existence; in reality, the verifier does not perform the extraction).

The SNARKs’ knowledge soundness can also be explained in information theory. Although a witness without any hints has the maximum amount of information, if there is a condition that some additional information is given such as a verification circuit, the conditional entropy of the witness becomes 0 bits. This is because the verification circuit determines all the rules for the values a witness can have. In other words, the SNARK prover is a lossless compression algorithm that utilizes the mutual information of a witness and a verification circuit.

In an application where the SNARK prover attempts to compress data for blockchain scalability, this assumption of prior information is very unrealistic. In blockchain scalability, one of the goals is to enable validator nodes to verify the correctness of large-dimensional original data with only very small-dimensional compressed data. Replacing a validator node with a SNARK verifier does not meet the goal of blockchain scalability, as the validator node already has large-dimensional data, a circuit.

Practical ways to apply SNARK to blockchain scalability

One practical way to overcome this problem is to add a new assumption that there exists a trusted third party who publishes the correct verification circuit. This assumption allows the validator nodes to perform their role simply by referencing the third-party’s publication without the need to carry verification circuits. However, This assumption inevitably implies the degeneration of blockchain decentralization.

There is also another way that utilizes distributed key generation technology based on threshold cryptography, such as Powers-of-tau (Snarky ceremonies). Simply put, it is an event in which multiple participants contribute to publishing honest verification circuits over a certain period of time, for example, two weeks. This method eliminates the need for a trusted third party, but it requires a new assumption that at least one (or more than half, depending on the protocol) of the participants are honest. This leads to the degeneration of the blockchain security.

Here, blockchain security and decentralization are elements of the Scalability trilemma defined by Vitalik Buterin in 2021.

The Scalability Trilemma (Image from Vitalik’s blog)

Conclusion

To summarize the story so far, the SNARK prover can act as a lossless compression algorithm for blockchain scalability, but this requires sacrificing either security or decentralization.

Then, is there any way to compress blockchain data without losing information as well as security? The approach of Optimistic Rollup is one such attempt. Optimistic Rollup compresses Layer 2 data using a traditional lossless compression algorithm (or a variation thereof) and then uploads it to Layer 1. Assuming the compressed data is not deleted (for example, there is no blobbing), it would be the best way for compression without compromising security. Unfortunately, Their way does not provide a cost-effective feature that allows validator nodes in Layer 1 to validate the correctness of the original data without data restoration.

This conclusion is in no way to diminish the value of SNARK-based scalability solutions (e.g. Zk-rollup). I deeply agree with Vitalik Buterin’s recent blog post about the need for diversity in Layer 2. Since Layer 2 solutions also cannot escape the shackles of the scalability trilemma, in my opinion, there is great value in Layer 2 by giving users options for better performance at the expense of some security or decentralization. In other words, given the security guaranteed by Layer 1, Layer 2 can provide users with the opportunity to experience a wider variety of Dapps.

Tokamak zk-EVM and Zk-rollup, which we are developing, is a new approach that combines a SNARK and a consensus protocol. It is completely different from existing zk-rollup methods such as Scroll and Linea, which rely entirely on SNARKs with complex verification circuits, and it is also completely different from Optimistic rollup, which only relies on a consensus protocol called Challenging. We hope that our challenge will provide a new basis vector in the diversity space that can be defined over the scalability trilemma.