Why hashed Personally identifiable information (PII) on the blockchain can be safe

Quantum computers can’t solve everything

Ben Longstaff
Meeco
4 min readFeb 13, 2018

--

Different data can hash to the same output

There is a misconception in the identity community about hashing that I want to set straight. It is important to clear this up so that it does not hinder General Data Protection Regulation (GDPR) solutions.

This tweet from the Twitter discussion on “why putting hashed PII on any immutable ledger(blockchain) is a bad Idea” sums up the misconception well.

The Quantum argument

Let’s assume quantum computing is a practical thing. Let’s go one step further and assume someone found a way to implement a universal quantum computer.

Awesome.

If you have access to unlimited computation. What does that mean for encryption?

Can we now find the private key to Bitfinex’s cold wallet? Yes.

Can we find your PII that was hashed onto the Blockchain? No, not if the data is sufficiently padded.

Lets start with finding the Bitfinex cold wallet’s private key

Your bitcoin private key is 32 bytes of data. This means that there is a constraint on what that the data can be. This makes the number of potential values finite.

32 bytes is 256 bits, a bit can be either a 0 or a 1 which means that there are 2 to the power of 256 different private keys. Which written down is

115792089237316195423570985008687907853269984665640564039457584007913129639936 (78 digits)

That’s a really big number, for context on how big, estimates of the number of atoms in the observable universe are around 1 with 80 zeros after it.

But it is a finite number of possibilities.

The public key has 65 bytes, which is 520 bits. This is an even bigger number

3432398830065304857490950399540696608634717650071652704697231729592771591698828026061279820330727277488648155695740429018560993999858321906287014145557528576 (157 digits)

There are 2 to the power of 64 different potential public key values for each private key.

Which looks like

29642774844752946028434172162224104410437116074403984394101141506025761187823616 (80 digits)

But here’s the thing, you have infinite computing available.

You can process every conceivable private key until you find the public key that matches Bitfinex’s wallet.

Say hello to your new found wealth.

So why are on-chain hashes fine?

Well lets look at a simple example, I have some data I want to store hashed on the blockchain

{
name: Ben Longstaff,
address: Westminster, London SW1A 0AA, UK,
}

1 character equals 1 byte, so this secret data could be stored with 70 bytes. To make it more secure lets pad our data out with an additional 99,930 bytes to a nice round 100kb. So it now our data looks like

{
name: Ben Longstaff
address: Westminster, London SW1A 0AA, UK
random:72151965a34b4b59151965a34
...
properties:4b4b59031290a031290a
}

So the input space could potentially be made up of 800,000 bits (100kb) which is 2 to the power of 800,000 which a 9 with 240,842 zeros after it. I wont paste the value in here as it’s over 4000 lines of data, needless to say its real big.

When running our data through the hashing function it outputs a 512 bit hash value.

dfdec888b72151965a34b4b59151965a34b4b59031290a031290adfdec888b72

The output space is made up of 2 to the power of 512 possibilities which is

13407807929942597099574024998205846127479365820592393377723561443721764030073546976801874298166903427690031858186486050853753882811946569946433649006084096 (155 digits)

Unlike the bitcoin wallet scenario where there are many potential output values (public keys) for one input (private key).

In the case of hashing our data there are many input values for one output value.

In this example there are 2 to the power of 799,488 (800,000–512) possible inputs to each output. So somewhere in that really big number is

{
name: Ben Longstaff
address: Westminster, London SW1A 0AA, UK
random:72151965a34b4b59151965a34
...
properties:4b4b59031290a031290a
}

Bingo right? Well no. You still have to find the needle in the haystack. With some smarts you could narrow down the possibilities using a dictionary.

Dictionary Attacks

Most of that huge number of potential inputs is going to be garbage and will be missing the property names that the smart contract needs. So you could use the property names to cut down the search space. You could also use a dictionary of names for every person on the planet.

While running your dictionary through all the possibilities your also going to get matches with

{
name: Katryna Dow
address: Bennelong Point, Sydney NSW 2000
random:72151965a34b4b59151965a34
...
properties:4b4b59031290a031290a
}
{
name: Derek Munneke
address: 3790 S Las Vegas Blvd, Las Vegas, NV 89109, US
random:172151965aa34b4b59034
...
properties:34b4b59151965290a031290a
}
and EVERYONE else on the planet

This is thanks to the padded garbage data.

Hashing can be used to verify Integrity not Authenticity.

There is no way to evaluate which of the possibilities was the original data, which is why your hashed PII is safe on the blockchain.

To summarise

There are two types of scenarios where hashing is useful

  • When there are many potential output values for each input value (e.g. public key encryption)
  • When there are many input values for each output value (e.g. checksum)

In the face of infinite computational resources, the first would be susceptible to revealing PII and the second (if implemented correctly) will not.

In Conclusion

The hash of sufficiently padded data can be stored on the blockchain and be GDPR compliant.

For more discussion on the future of identity follow us on Twitter

If you would like to get notified when I publish a new article, please join my mailing list.

--

--

Ben Longstaff
Meeco
Writer for

Playing at the intersection of privacy and personalisation. Fascinated by the state of trust in a world with leaky data.