Hashing for non-technical people
A simple explanation of how hashing works and its applications
A soon-to-be lawyer asked me recently: what should I know about the latest technological advancements to be a well-rounded professional going forward?
We talked a bit about everything — AI, Blockchain, and all the other buzzwords. And naturally, we touched upon the concept of hashing — where people usually get stuck on.
At a high level, encryption, for example, is quite intuitive. You “lock” something with a key and then unlock it by using the appropriate key.
Hashing, however, is hard to grasp even at a high level. Hashes are one-way functions that generate a deterministic output from a given input, such that the output (hash) can be used to identify and validate the input, without revealing anything about it.
What’s difficult about the concept is that the function is one-way only. You run an input through a mathematical function and get an output that cannot be reversed. But how could that be? If I know the function and I know the output, I can surely derive the input, no?
Rethinking mathematical operators
For the majority of us, when we think about mathematical operators, we think of operators for addition, subtraction, multiplication, and division. Hence, the concept of hashing seems unreasonable.
Let’s say we have a function f(x) = x + 2.
If I give you the output, say y = 4, you can easily work back the function and find that x = 2.
But while that is true for the four operators mentioned above, it is not true for all operators in existence.
Introducing the modulo operator
The modulo operator, usually defined by the symbol % (yup, that’s right), gives us the remainder of a division of two whole numbers.
Here are a few examples:
4 % 2 = 0 (pronounced ‘four modulo two equals zero’)
5 % 2 = 1
6 % 4 = 2
Did you get the hang of it? First, we figure out the maximum amount of times the second number fits into the first. Then, the modulo operator will give a result of whatever is leftover.
2, for instance, fits twice into 5, and the remainder is 1. Hence, 5 % 2 = 1.
I get it, now why is this important?
Remember our function f(x) from the beginning?
Now let’s say f(x) = x % 2.
I tell you y = 0. Can you find x?
While your brain might intuitively think x = 2, the truth is: x can be 2, or an infinite amount of other numbers.
Think about it: 4 % 2 is also 0, so is 10 % 2 and so is 935628038264 % 2.
Hence, in practice, you could never figure out what x is because the different values it can take are infinite. Any even number between 2 and infinity modulo 2 will equal 0.
Linking it back to hashing
While modulo operators are far from the only thing at work in secure hash functions, they help explain quite well how we can know a function and its output but still be unable to find the input.
As such, hashes are not a form of encryption. Encrypting something implies it can also be decrypted, which is not the case for hashes. Rather, they are a commitment to some specific data.
To grasp this concept of commitment, think about a blind auction. Instead of submitting a bid, participants could submit a number that identifies their bid, which establishes their commitment to it. The number can only be generated by that specific bid. Then, when all bids are revealed, it is impossible for participants to change their own after seeing everyone else’s, since they had previously submitted a number that can only be generated by their specific bid, and no other. Think about hashes as that identifying number.
Hashes work well for this purpose because they are irreversible, meaning that hashing private data is not problematic, but also because there are so many possible hash outputs for a given hash function that it’s highly unlikely that you will find two sets of data that give you the same output.
In fact, that is one of the requirements for a secure hash function. It should be computationally infeasible to find two inputs that give the same hash. So when I say highly unlikely, I essentially mean impossible by our current computing standards. But we should never use that word.
To put things into perspective, one of the most common hashing functions used today, SHA256, allows for 2^256 different hashes, which is a really large number of combinations. If you’re still doubting how secure that is, you should check out the video below, which explains how even if you left a supercomputer working for hundreds of billions of years, it would still not have a great chance of breaking SHA256:
Since two inputs cannot feasibly generate the same output, hashes can be considered deterministic, i.e. they can be used to identify a piece of data. This happens because one tiny change to the data will generate a completely different hash. And that’s another rule for a secure hash function: you should never be able to figure out a pattern from the outputs. For example, if we run the words “hash” and “Hash” through a hash function, here’s what we get:
SHA256(“hash”) = d04b98f48e8f8bcc15c6ae5ac050801cd6dcfd428fb5f9e65c4e16e7807340fa
SHA256(“Hash”) = a91069147f9bd9245cdacaef8ead4c3578ed44f179d7eb6bd4690e62ba4658f2
Any tiny change in the input will completely change the output. Which is a very important property for the applications you are going to learn about now.
So now that we know how hashing works, let’s talk about why we need it.
A common use of hashes talked about nowadays is as the key ingredient to Proof Of Work, the algorithm that secures Bitcoin and other blockchains. However, hashes can and have been doing a lot of other things since their invention decades ago.
Two of the main ways hashes are used today work to protect you from attackers on the internet — even if you might not know it.
First, they are used by most websites nowadays to store your passwords. In fact, if a website does not hash its passwords, it is a very dangerous one.
Proper websites (referred to as just ‘websites’ from now on) will never store your password as is. Rather, they will hash your password and store the hash. So every time you log in to a website and input your password, it is hashed again and the hashes are compared. If they match, you are logged in, otherwise, you are not granted access.
By doing things this way, websites prevent even their own database administrators from finding out your password, as well as protect your privacy from attackers who get access to the database. A hacker will only get a hash of your password, which cannot be used to log in.
(This is why having a strong password is important, because it makes it is less likely that a hacker will find it on a rainbow table).
Additionally, hashes are used to prevent you from downloading malicious software. Software producers, upon releasing a new version of the program, will run the entire codebase through a hash function and publicly disclose the resulting hash. This way, when you download the software, you can run the code you got through the same function and check if you got the same hash (called a checksum). If you do, the software you downloaded is exactly the software that was produced. But, if the hashes do not match, your download either failed (which is dangerous) or you got software that has been modified (even more dangerous), so you should not use it.
Generally, just like with passwords, this process happens without the user knowing. Apple’s App Store, for instance, will do this for you and only allow you to download software that passes the checksum check. However, if you download software directly from the web, you should ideally do this yourself, which is not as complicated as it may sound.
Hashes are great for these two applications because any minor change will result in a non-matching hash, offering a great degree of protection. For example, a password with one wrong character will not log you in, and a software with one wrong character anywhere in its massive codebase will not have a matching checksum.
As technology moves forward, multiple industries are being disrupted by the latest advancements.
And now, more than ever, new applications are coming about for hashes. From the aforementioned algorithm which secures Bitcoin to implications for Intellectual Property, hashes can have a multitude of applications that are still not fully explored.
Essentially, hashes are great proofs for data — they provide a way to verify a piece of data without having to disclose it, providing security and privacy to users. They also help to authenticate the data when it is disclosed.
Let’s say hashes are applied to legal contracts (this is already happening). If two parties originally attest to signing Document A with hash X, any change to Document A will produce a hash that is not X, hence proving that the contract has been doctored.
Additionally, imagine that a game requires participants to guess a secret, and the correct guesser wins a prize. If the secret was simply stored somewhere, it could easily be tampered with to prevent someone or everyone from winning. However, if a hash is publicly disclosed before the game starts, it provides a commitment to the secret without revealing any hint about it. This way, the secret cannot be changed along the game, and the correct guesser will definitely win the prize.
Now imagine that instead of a game, we were talking about medical records or legal documents. A lot of fraud could be prevented by providing public commitments to private data, ensuring the validity of the data when it comes to light in the future.
But at this point, we’ve covered a lot of ground, so I’ll stop here and let your imagination flow about the endless possibilities of hashing — which, by now, I hope you know something about.