Blockchain Ep04: Unhashing the Hash Functions — Part 2

Alabhya Mishra
Coinmonks
6 min readDec 4, 2021

--

In the second post of this series, we saw the first mention of Hashing Functions and in the previous post we took a deeper look into how the functions get some of those features. The features we have seen so far are —

  • Irreversible due to Modular Arithmetic
  • Fixed Length Output partially due to Modular Arithmetic
  • Pseudo-randomness, meaning Output doesn’t look anything like Input due to Hashing (cut, dice, mix, blend etc.)
  • Extreme input sensitivity due to Hashing

In this post we will look at how the fourth feature of Hashing Functions can ensure that two different inputs don’t produce the same output — Collision Resistance. Unlike the last post, I would recommend that you read this one. Its not as complicated, and we will be building on this one in the next post. But first, a new concept…

Pigeon Hole Principle (PHP)

If you have more pigeons than you have holes in the wall to house the pigeons, at least one of the holes will have to house more than one pigeon

Where am I supposed to go?

From Feature 2 in the last post we know that Hash Output is a fixed length number. Which means there is an inherent limit to the total number of different Hash Outputs we can generate. For example, if the Hash Output is limited to 4-digit number, then we only have 10000 possible Hash Outputs (0000–9999). However, there is no such limit on the number of possible inputs, there are more inputs than there are outputs. Thus by PHP, some inputs will have to map to the same Hash Output.

How Collision happens when mapping an infinite set of inputs to a limited set of outputs

This is called Collision, two inputs collide at the same output. And we do not want collision in our Hashing Function.

Feature 4: Collision Resistance

Probability of two different inputs producing the same output is practically 0, each input produces a unique output, like a fingerprint

We can’t really get around PHP, its a fact, an unavoidable truth about the world. But we can design things such that it can be avoided for as long as possible. Specifically there are two ways —

Method 1: Increase the number of all possible outputs

We saw that if we limit ourselves to 4-digit Hash Output, we have 10000 or 10⁴ possible outputs available. If we use a 78 digit long output, we would have 10⁷⁸ possible Hash Outputs, which is larger than the total number of atoms in the known universe. This is very doable and that is what SHA-256 does.

SHA-256 or Secure Hashing Algorithm-256 is the Hashing Function used by the Bitcoin Blockchain. BAM! We just connected our make believe ‘blockchain’ with a real world Blockchain!

Method 2: Probing

Just because there are many possible Hash Outputs, doesn’t mean that we have avoided Collision. It is still possible for 2 random inputs to generate the same Hash Output. But it is possible to get around this. Lets store all the Hash Outputs ever generated in a Hash Storage Table, HST. Then we do the following for every new input —

  • If new input generates a new & unseen Hash Output, we add it to the bottom of the HST
  • If new input generates a Hash Output which is already in the HST (ah Collision happened!), we change the input slightly to generate a new Hash Output, which (hopefully! ) avoids collision
  • We keep modifying the input slightly until we find a collision free Hash Output
Probing in action

The Modifier is a number appended to the end of the original input. We usually start with 0 and keep repeatedly increasing the modifier by 1 until we are able to find a Hash Output that avoids collision. And this modifier is stored separately for each input, because we want to keep the original input intact. This helps in reproducibility — if the same input is provided again, we know what modifier needs be applied to reproduce the output.

Storing Modifier in Blocks

As we have already seen in prior posts, each block in the chain stores 3 different Hash Outputs. The Data ID, the Previous Block ID and the Current Block ID. Which means, for every new block in the chain, we are adding 2 new Hash Outputs — Data ID and Current Block ID.

Two possible Collision points. Reminder that these Hash Outputs are written in the Hexadecimal System

We don’t really care much about the collision in Data ID. If two block’s have the exact same data, which is entirely possible and allowable, they are going to have the same Data ID. That is fine!

We can’t have collision in the Current Block ID though, because two blocks cannot have identical Current Block ID. So we need to perform Probing, and we need to store the modifier somewhere. The current block structure doesn’t have space to store it!

Author’s Note: Why can’t two different blocks have identical Current Block ID?

A modified block structure to store the probing modifier!

To start with, Modifier is always empty for a new block. If collision occurs, we put 0 in the Modifier and recalculate the Current Block ID using Data ID, Previous Block ID and Modifier as inputs to the Hash Function. By including Modifier, we have a different input which changes the Hash Output, hopefully this solves collision. If, however, the collision still persists, we will increase the Modifier value to 1. And we will keep on increasing Modifier value by 1 each time until the collision is solved. Thus Modifier increases by 1 until there is no Collision in the block’s Current Block ID. The chain is now collision resistant!

In Conclusion

Honestly, we do not need Probing for real world blockchains. Given the large number of possible Hash outputs (~10⁷⁸ possible outputs) and the smart and complex randomness built in Hash Functions, collisions never happen. I do not have space nor the mathematical ability to prove it to you, but it is true! We don’t really need Modifier to prevent collisions in the real world. So why did I even bring it up?

The blocks in the blockchain still have a space for Modifier, called Nonce which stands for Number Oncebecause each number is tried once! Every block has a Nonce, and a whole lot of time, money and energy is spent on finding the right value of Nonce. In the next article, why Nonce exists and what purpose it really serves. See you then…

Cliffhanger

What is Proof-Of-Work? Why are Bitcoin and Ethereum said to consume a lot of electricity?

Join Coinmonks Telegram Channel and Youtube Channel learn about crypto trading and investing

Also, Read

--

--

Alabhya Mishra
Coinmonks

Working in finance, data science and analytics. Interested in learning Blockchain