The underlying technology behind cryptocurrencies known as Blockchain has recently begun to get a lot more attention. There have been a lot of explainer articles written on it, in fact, I wrote one myself, Blockchain and The Impact on Our Future(I highly recommend reading that before this unless you already have a basic understanding of blockchain technology), but there are almost no free ways to go deeper and learn more about Blockchain unless you count parsing through long academic papers.
This article will be going deeper into one of the more unexplained parts of Blockchain: the hash function. I will mainly be discussing the SHA-256 hash function, as at this time it is the most widely used. If you don’t have a basic understanding of Blockchain, I recommend learning more about it and reading my article (linked here).
What is the hash function? Most articles tend to skip over this piece of information, as it is too complex for a high-level article. The high-level version that I gave was, “A hash is just the output of a function that has inputs of specific information from the transaction, like the amount, public keys, timestamp, etc.”
A hash really is just an output of a function. In the case of cryptocurrencies, the function is a cryptographic algorithm. However, this isn’t just any algorithm. This algorithm actually has a certain set of properties that are required for it to be considered secure.
There are multiple properties required for a regular cryptographic algorithm, but the ones unique to a hash function are as follows.
Fixed Output Length
“This is a test” returns a completely different value than “this is a test”
The output of a cryptographic hash function must have a fixed length. In the case of SHA-256, this is 256 bits. This might not seem useful when your input is, “Hi,” but when sending large amounts of data, this becomes very useful. Because the number of possible inputs is infinite, this is very useful for large inputs, but presents some problems for the unrefined algorithms.
Side note: H(x) means hash of x.
H(Hi) — 0x9feB72F82D05aB448936055F7FA34aCc2afB614A
H(Library of Congress: Item 1……………(terabytes of information)) — 0x1dwR24E51W05aB488346055F7FA34aDd2axB614S
With infinite inputs but a fixed output, a cryptographic hash function must have some cases in which two different inputs returned the same output. An example of this would be if I inputted GRAPEFRUIT and ORANGE separately into the function, and both returned PUNCH (I know it has to be 256 bits, bear with me). This is called a collision. It is basically impossible to have a collision-free cryptographic function, which is why we use the term resistant.
“Michael” and “Toby” both return 2 in this example of a collision.
Fortunately, there is a very low chance of a collision with cryptographic functions. If you look at the math, there is only a 50–50 chance of a collision at ²¹²⁸ inputs, or approximately 3.4 * 1⁰²⁸ inputs. Just putting that into perspective. Collisions are there, but really hard to find, so with a secure function like SHA-256, it is safe to assume that if I had two unknown inputs and both returned STRAWBERRY, they are the same input.
The hiding property means that given an unknown value r chosen from a distribution with high-min entropy and unknown value x, knowing Y, it is infeasible to find x such that Y = H(r|x). However, x should still exist, and it should be easy to calculate Y knowing x and r. This property is extremely important as it is the logic behind hashing itself.
The steps of hashing are basically just:
- Hash the input. This is just putting the input into a function, like the SHA-256 hash function (also known as Bitcoin’s hash function).
- Generate a nonce. This is a completely random string, with no logic/thought process behind it.
- Concatenate the nonce with the hash.
- Hash the new string.
We can see that Y is the output of step four in hashing, the final hash. r is the nonce, and x is the hash of the input. Applying the hiding property to hashing basically gives us: Given an unknown nonce chosen from a distribution with high min-entropy and unknown hash of original input x, knowing the final output Y, it is infeasible to find the hash of x. However, x and its hash should still exist.
This means that (prepare to be confused):
For every output Y, if k is chosen from a distribution with high min-entropy it is infeasible to find an input x such that H(k|x) = Y.
Let’s break this down. Y is the output of the function, or the hash. High min-entropy means that the distribution from which a value is chosen is huge. For example, low min-entropy would be picking an integer from 1 to 5. High min-entropy would be picking a number from one to 37 trillion. We know that H(X) means hash of x. The “|” denotes concatenation. For example, blue|bird = bluebird and k|x = kx.
So, what this really means is: For every output Y, if k is chosen from a huge distribution, it is infeasible to find an input x such that the hash of k added to x = Y. For every hash, if a random string is chosen from a large number of strings, it is infeasible to find an input x such that the hash of x concatenated with k is equal to Y.
This is extremely useful when mining as it makes a miner’s output verifiable. This is because it is really hard to find x, the input, but when it is found, it is easy to check. Without this, it would take much longer than 10 minutes for a new block to be created as every miner would have to verify the first miner’s work by guessing and checking.
Cryptographic hash functions are the logic behind blockchain’s security. Without these, Bitcoin would exist as a mere joke. You basically know everything that makes a cryptographic hash function what it is, its fundamental properties. This is the kind of knowledge that you get from reading research papers, or this article. Time to do something cool with it!