Cryptographic Accumulators

There is a lesser-known technique on the crypto-developer’s tool belt. A cryptographic accumulator is a primitive with several exotic properties that can be used to build various zero-knowledge proof systems. Let’s explore the concept, the maths, and an example application.

Applications: Accumulators were originally proposed for timestamping purposes [5], i.e., to record the existence of a value at a particular point in time. Over time, other applications such as membership testing, distributed signatures, accountable certificate management [8] and authenticated dictionaries [24] have been proposed. Accumulators are also used as building block in redactable [35, 36], sanitizable [14], P-homomorphic signatures [2], anonymous credentials [40], group signatures [41], privacy-preserving data outsourcing [39] as well as for authenticated data structures [22]. Moreover, accumulator schemes that allow to prove the knowledge of a (non-membership) witness for an unrevealed value in zero-knowledge (introduced for off-line e-cash in [38]) are now widely used for revocation of group signatures and anonymous credentials [13]. Quite recently, accumulators were also used in Zerocoin [30], an anonymity extension to the Bitcoin cryptocurrency.
Revisiting Cryptographic Accumulators, Additional Properties and Relations to other Primitives.

An recent development is that proof verifications can now be implemented in Ethereum smart contracts. Indeed, the main type of accumulators is similar to RSA and is based on modular exponentiation, which is supported since the Byzantium fork (EIP-198).

We can think of a crypto accumulator as a super-charged hash function that works on sets. A regular hash function, like SHA-3, takes a single message and outputs a fixed-size hash. An accumulator, however, takes a set of values and turns them into a single number, also of constant size. In a sense, accumulators are the asymmetric cryptography cousin of Merkle trees and Bloom filters.

Consider the simplest commit-reveal protocol: Alice has a secret message and publishes its hash, called a commitment. Later, she reveals the whole message to Bob, who can verify that it does match the commitment. Now, if Alice used an accumulator instead of a hash, she may choose to reveal only one, some, or all parts of the message.

Proof of Membership

Accumulators work with sets of values. We will see several examples of encoding data as set memberships later. For now, let’s just assume that Alice turned her message into a set of words and published the associated accumulator (which looks like a hash).

She can choose some of the words and produce a proof, which is another constant-size number. It allows Bob to verify the integrity of the revealed words: that they do belong to the committed set. However, Bob does not learn anything about the other parts that were kept secret: this property is called zero-knowledge.

Proof of Nonmembership

Surprisingly, the opposite operation is also possible: Alice can prove to Bob that some words were not part of the committed set.

An example: Bob asks whether the first word was “cat”, or maybe “dog”. Alice answers with a proof that it was neither, and Bob can verify this fact. But he still has no other hint of what the word could be.

Now that is a very unique property. Compare with hash functions and Merkle trees. Either the hash of a value is not zero-knowledge: one can take any guess and check its hash. Or the value is protected by an unguessable salt, in which case one cannot prove the non-equality or nonmembership of a particular guess.

What a paradox that accumulators actually resolve. Note that this type of proofs does grow in size with the amount of data to disprove.

The maths

There are many variants of accumulators (a comparison here) and formal studies. Let’s focus on one based on RSA. I’ll assume some knowledge of RSA (Wikipedia primer) and present the basic ideas.

The first idea is to work with sets by computing the product of all values in a set. If we map the input data to prime numbers, their product will uniquely represent the set. Otherwise, there would be confusions: {2,6} would give the same 12 as {3,4} for instance. Sets will be written with curly braces like {x} and their product with capital letters like X.

Let’s pick a modulus N (the product of two large primes), a generator G (any other prime). We’ll come back to it below.

For a set of secret values {u} and their product U, the accumulator C is computed by modular exponentiation. C is a number about as large as N.

C = G^U % N

Next, let’s take a subset {r} of values from {u} to be revealed. To compute a proof, we actually need all the other values of {u}, those that remain secret, noted {s}. In product form, we have R * S = U. The proof P is:

P = G^S % N

Then we reveal {r} and P to Bob. He will compute C’ and verify that it equals the commitment C:

C’ = P^R % N

By replacing the P, we see that C’ must equal C:

C’ = G^S^R = G^(S*R) = G^U = C % N

The system relies on the same assumptions as RSA:

  • the hardness of the discrete logarithm (Bob finding U from C, or S from P)
  • the RSA problem (Alice forging a false proof P)
  • the hardness of integer factorization (finding the factors of N)

This also assumes that S is very large, otherwise Bob could brute-force it. To prevent it, Alice can add a large random prime into {u}, which we could call a salt.

Proofs of non-memberships are a bit trickier, but here is the idea. Take a set {x} that contains some elements that do not belong to the committed {u}, in addition to those that do (the set {r} above). The GCD (greatest common divisor) of X and U will be R. Alice will use the extended Euclidean algorithm to provide the coefficients that allow Bob to verify this fact. See the paper above for details.

Trusted setup

A drawback of this system is that it requires the modulus N to be the product of two primes, but Alice must not know those factors. There are two approaches there.

Bob can generate N and ask Alice to make her proofs with it, and he will trust them. But Bob can use his factors to forge accumulators and proofs, therefore others will not trust them, and Alice has deniability.

But what we really want is to use the system publicly, and for that we need a trusted setup. This means that some computer has to generate N from random factors, and forget the factors. If one believes that the factors were not saved by anybody, then he can trust the system.

There is in fact such a number: RSA-2048. Well, most likely. As far back as 1991, the RSA Laboratories published a list of numbers and challenged anyone to factor them, with a bounty of $200,000. Here is Ron Rivest (the R in RSA) talking about it.

N = RSA_2048 = 25195908475657893494027183240048398571429282126204032027777137836043662020707595556264018525880784406918290641249515082189298559149176184502808489120072844992687392807287776735971418347270261896375014971824691165077613379859095700097330459748808428401797429100642458691817195118746121515172654632282216869987549182422433637259085141865462043576798423387184774447920739934236584823824281198163815010674810451660377306056201619676256133844143603833904414952634432190114657544454178424020924616515723350778707749817125772467962926386356373289912154831438167899885040445364023527381951378636564391212010397122822120720357;

Example: classified documents

How do we actually implement something useful? We have to be a bit creative to encode data as sets. Here is one way and a sample implementation.

Let’s take the example of a governmental agency that archives classified reports on its activities. As some scandal breaks out, the public requests a particular report to verify what the agency had been up to at that time. However, the agency argues that some details are too sensitive, like the identity of its agents, and will edit them out with a metaphorical black marker.

Let’s have a process where the agency publishes the accumulators associated with classified documents as it produces them day after day, as commitments. When the time comes, the agency publishes parts of the content, and the proof that it is a faithful extract of a committed document.

Report 123                                          1 January 1970
Today, agent ##### totally did the right thing.

Let’s represent the document as a sequence of bits, using some encoding: 00001101 …

Next, we attribute a unique number to each bit. Our accumulator works with prime numbers, so we enumerate primes starting with 2, 3, 5… We sort the numbers into two lists, depending on the value of the bit:

Document: 0  0  0  0  1  1  0  1  …
Zeros: 2 3 5 7 — — 17 — …
Ones: — — — — 11 13 – 19 …

Using the accumulator, we commit to the list of numbers associated with the zeros (2, 3, 5, 7, 17, …), but not to the others (11, 13, 19 are not included).

The numbers for a full document will go well into the millions, but that is fine. The time complexity is O(n.log(n)) for n bits of message. In practice we’re looking at a few Kbits/s with a naive implementation.

Using the commitment alone, we cannot say whether any number was included or not, so we learn nothing of the content of the document.

Now, the agency wants to publish parts of this document, but not others. Let’s mark some bits to remain hidden:

Reveal: 0  0  0  #  #  1  0  1  …

Along with the partial document, it publishes a proof to establish that:

- 2, 3, 5 and 17 belong to the committed set, representing 0s.
- 13 and 19 do *not* belong to the set, representing 1s.
- But it says nothing about 7 and 11: they may or may not belong to the set.

From there, we can verify that the revealed parts of the document are authentic, but we learn nothing of the parts that remain classified.

Example: Zero-knowledge non-interactive proofs

There is a more elaborate demo application involving Sudokus, the Fiat-Shamir heuristic, and proof-of-works. Stay tuned for a future article on this.