The State of Hashing Algorithms — The Why, The How, and The Future
One of the keywords newcomers hear when learning about blockchain are the notions of a hash and a hashing algorithm which seem to be ubiquitous for security. Running a decentralized network and consensus machine such as the Bitcoin or Ethereum network with tens of thousands of nodes connected via p2p requires both “trustlessness” and verification efficiency. That is, these systems need ways to encode information in a compact format that allows for secure, fast verification by their participants.
The main primitive handled by both Bitcoin and Ethereum is the notion of a block, which is a data structure encompassing transactions, a timestamp, and other important metadata. A critical part of their security involves being able to compress large chunks of information about the global state of the network into a short, message standard which can be efficiently verified if need be, known as a hash.
Cryptographic hashes are used everywhere from password storage, to file verification systems. The basic idea is to use a deterministic algorithm that takes in one input and produces a fixed length string every time. That is, using the same input will always result in the same output.
Not only is determinism important for hashes, but a single bit that changes across inputs will produce an entirely different hash.
A problem with hashing algorithms is the inevitability of collisions. That is, the fact that hashes are a fixed length string means that for every input we can imagine, there are other possible inputs that will result in the same hash. Collisions are bad. It means that if an attacker is able to create collisions on demand, the attacker can pass off malicious files or data as having a correct, proper hash and masquerading as legitimate. The goal of a good hash function is to make it extremely difficult for attackers to find ways of generating inputs that hash to the same value
Computing a hash should not be way too efficient, as that makes artificially computing collisions easier for attackers. Hashing algorithms need to be resistant towards “pre-image attacks”. That is, given a hash, it should be extremely difficult to calculate retrace the deterministic steps taken to reproduce the value that created the hash (i.e. the pre-image).
Given s= hash(x), finding x should be near impossible.
To recap, “good” hashing algorithms have the following 3 properties:
- Changing one bit in an input should create an avalanche effect and result in an entirely different hash altogether
- Should have very low probability of collisions
- Some degree of efficiency that does not sacrifice collision resistance
One of the original hashing algorithm standards was the MD5 hash, which was widely used for file integrity verification (checksums), and storing hashed passwords in web application databases. Its functionality is quite simple, as it outputs a fixed, 128 bit string for every input and uses trivial one-way operations across several rounds to compute its deterministic output. Its short output length and simplicity of operations made MD5 thoroughly easy to break and susceptible to what is known as a birthday attack.
What the heck is a “Birthday Attack”?
Ever heard the fact that if you put 23 people in a room, there is a 50% chance two of them will have the same birthday? Bringing the number up to 70 in a room gives you a 99.9% chance. This arises from what we call the pigeonhole principle, which roughly states that given 100 pigeons and 99 boxes to put them in that must be filled, 1 box will share 2 pigeons. That is, fixed output constraints mean there is fixed degree of permutations upon which collisions can be found.
In fact, MD5 is so weak to collision resistance that a simple, household 2.4GHz Pentium Processor can compute artificial hash collisions within seconds. Moreover, its extensive usage in the earlier days of the current web has created tons of leaked MD5 pre-images online that can be found with a simple Google search of their hash.
Diversity and Evolution of Hashing Algorithms
The Beginning: SHA1 & SHA2
The NSA (yes, that NSA), has long been a pioneer of hashing algorithm standards, with their initial proposal of the Secure Hashing Algorithm or SHA1, creating 160 bit fixed-length outputs. Unfortunately, SHA1 merely improved upon MD5 by increasing the output length, number of one-way operations, and complexity of these one-way operations, but does not provide any fundamental improvements against more powerful machines attempting different attack vectors.
So how could we do any better?
In 2006, the National Institute of Standards and Technology (NIST) released a competition to find an alternative to SHA2 that would be fundamentally different in its internals to become a standard. Thus, SHA3 was born as part of a great scheme of hashing algorithms known as KECCAK (pronounced ketch-ak).
Despite the name, SHA3 differs greatly in its internal through a mechanism known as a sponge construct, which uses random permutations to absorb and output data while serving as a source of randomness for future inputs that go into the hashing algorithm.
SHA3 maintains an internal state with more bits of information relative to the output that allows it to overcome the limitations of previous algorithms. It became a standard in 2015 by NIST.
Hashing and Proof of Work
When it came to integrating a hashing algorithm into blockchain protocols, both Bitcoin, being older, opted for SHA256 while Ethereum used a modified SHA3 (KECCAK256) for its proof of work algorithm. An important quality of choosing a hash function for a blockchain using Proof of Work, however, is efficiency of computing said hash.
Bitcoin SHA256 implementation can be computed extra efficiently by specialized hardware known as Application Specific Integrated Circuits (or ASICs). Much has been written on the use of ASICs in mining pool and how they make a protocol tend towards centralization of computation. That is, Proof of Work incentivizes groups of computationally efficient machines to aggregate into pools and increase what we denote “hash power”, or a measure of the number of hashes a machine can compute per time interval.
Ethereum, opted for a modified SHA3 known as KECCAK256. Additionally, Ethereum’s proof of work algorithm, Dagger-Hashimoto, was meant to be memory-hard to compute for hardware.
Why does Bitcoin use a double SHA256?
Bitcoin has an interesting way of hashing data with SHA256 as it runs two iterations of the algorithm in its protocol. Note, this is NOT a countermeasure for birthday attacks, as it is clear that if hash(x) = hash(y) then hash(hash(x)) = hash(hash(y)). Instead, a double SHA256 is used to mitigate against a length-extension attack.
Essentially, this sort of attack involves malicious actors knowing the length of a hash input which can be used to trick the hash function to start a certain part of its internal state by appending a secret string to the hash value. SHA256, being in the SHA2 algorithm family, suffers from this pitfall and Bitcoin mitigates it by computing the hash twice.
Ethereum 2.0 and BLAKE
SHA3 was not the only breakthrough that came out of the NIST hashing competition in 2006. Even though SHA3 won, an algorithm known as BLAKE came a close second. For Ethereum 2.0’s sharding implementation, more efficient hashing is almost a feature requirement and something being taken very seriously by the research teams. The BLAKE2b hashing algorithm, which is a highly upgraded version of the BLAKE from the competition, is being heavily explored for its fantastic efficiency when compared to KECCAK256 while maintaining a high degree of security.
Calculating BLAKE2b is actually 3 times faster than KECCAK on a modern CPU.
Moving Forward: The Future of Hashing Algorithms
It seems like no matter what we do, we’re just either (1) increasing the complexity of internal hash operations, or (2) increasing the length of the hash output hoping attackers’ computers will not be fast enough to compute collisions effectively.
We are relying on the ambiguity of the pre-images of one-way operations for security of our networks. That is, the security goal of a hashing algorithm is to make it as hard as possible for anyone to find two values that hash to the same output, despite there being an infinite number of possible collisions.
What about a quantum computing future? Will hashing algorithms be safe?
The short answer and current understanding is that yes, hashing algorithms will stand the test of time against quantum computing. What quantum computation will be able to break are those problems which have rigorous, underlying mathematical structure founded by neat tricks and theory such as RSA encryption. Hashing algorithms, on the other hand, have less formal structure in their internal constructs.
Quantum computers do give an increased speed up in the computation of unstructured problems such as hashing, but at the end of the day, they would still be brute forcing an attack the same way a computer today would attempt to do so.
Regardless of what algorithms we choose for our protocols, it is clear that we are headed towards a computationally-efficient future, and we must use our best judgment to pick the right tools for the job and ones that will hopefully stand the test of time.