What is hashing?
Take a look at this poem:
Recognise it? It’s Shakespeare’s Sonnet 18. There aren’t many first lines more famous than “Shall I compare thee to a summer’s day?”.
Not a fan of poetry? How about the Book of Genesis?
To be fair, it’s difficult to recognise the story from just Chapter One, but the line “In the beginning…” is very memorable.
Still not for you? Then try this one…
OK yes, this is all gibberish. But it may be recognisable gibberish to some people. And if anyone recognised the last one, I’d be worried.
You see, you might spot the last series of digits and letters in a database. If those numbers and letters can be paired with someone’s email address, that person would be in trouble. That’s because this gibberish is known as a hash. And this hash tells us their password is “123456”.
What is hashing?
In cooking, to hash something means to chop it and then mix it. It’s from this we get “hash browns” — the shredded-and-fried potato king of an English breakfast. A hash function is where a computer takes an input of any length and content (e.g. letters, numbers, and symbols) and uses a mathematical formula to chop it, mix it up, and produce an output of a specific length. This is how we can take all 14 lines in Shakespeare’s sonnet, or the 31 verses in Chapter 1 of Genesis, and produce an output of only 32 characters.
We can create hashes of nearly any digital content: a document, an image, a song, anything. It may be easier to understand when you remember that any digital content is, at its lowest level, just a string of zeros and ones. Everything can be represented in binary: “hello!” in UTF-8 binary is 01001000 01100101 01101100 01101100 01101111 00100001. It’s this binary code that a hash function is scrambling.
There are lots of different mathematical formula you can use to produce hashes. In fact, hashes don’t even need to be long, as this origin’s story shows. Each hash function chops up and mixes in different ways and so will produce different outputs. For the earlier hashes, I used a hash function called MD5 which stands for “Message Digest”. Some hash functions produce very long hashes: SHA3–512 (“Secure Hash Algorithm 3”) creates a hash 128 characters long.
The same content run through the same hash function will always produce the same output. In this way, this series of 32 characters acts a bit like the content’s digital fingerprint. Take a look at what happens when I change just one word in Shakespeare’s sonnet (“summer’s” to “winter’s” in the first line):
They’re completely different, right? You can’t see any connection, other than the series are both hexadecimal (i.e. using 16 symbols, 0–9 and a-f, a process commonly used in computing as a simple way to represent binary). Just like humans have fingerprints that are unique to each individual, even a slight change to a piece of content will result in a whole new hash to represent it.
This is extremely useful when establishing something called “message integrity”. If two things produce the same hash, we can be pretty confident that the pieces of content are identical. We can therefore use hashing to quickly check whether changes have been made. Let’s imagine a politician wants to release a political statement to the public. The politician approves the message, but wants to be sure nothing, absolutely nothing, is edited by any unscrupulous staff before it gets to the printers. If the politician hashes the content of the statement, she can give the resulting hash to the printers over the phone. If the printers hash the statement they’ve been instructed to print and get the same result as the politician, then the message can be approved and released. It hasn’t been tampered with.
It can be useful to prove message integrity in lots of situations. Take blockchain — blockchain is the software on which many new applications and cryptocurrencies are built. People can set up blockchains to work in different ways, but fundamentally a blockchain is a way to store and track information over a network. It’s great for any systems that require auditing as it’s very hard to cheat a blockchain system that’s set up properly.
When records are submitted to the blockchain, they’re collected into groups known as blocks and the information in each block is hashed to give it a unique fingerprint (we’ll call the first group, Block A). The hash for Block A is then also added to the next group of data to be submitted (Block B), for which a different hash is created. If you wanted to go back and remove something that was recorded in Block A, you’d change the hash for Block A. You’d then need to change the hash for Block B, because Block B uses Block A’s hash, and so it now has a new fingerprint. If you imagine the system had just submitted Block M and was adding a new block every few minutes, you can see that any hacker would be playing a constant game of catch-up, trying to make all the hashes match up. This is a very simplified explanations, but shows the value of hashes when you want to prove the integrity of a system.
Another useful thing about hash functions is that they are one-way: we cannot reverse a hash to find the original content. To use the cooking analogy again, if you chopped up your potatoes to make hash browns you cannot reverse the process and get your whole potato back (though hash browns are infinitely better than plain potatoes so I don’t know why you’d want to).
One useful application of this irreversibility is password storage. Imagine you create an account for a new service — we’ll call it Chop and Change Recipes. Chop and Change want you to have an account so you can save recipes you like, contribute to online discussions, and access member-only content. This account is unique to you, so they require an email address (a unique identifier, as an email address can only be used by one member). They also require a password so you can verify that you are the person in charge of the account (if you do give your password to someone else, you’re essentially also giving them the right to use that account).
If I was running Chop and Change, I don’t want to know what your password is. According to a survey by Keeper Security, over 80% of people reuse their passwords. Imagine if you used the same password for Chop and Change as you do for your Facebook account, or to log on to your computer at work — that’s a big responsibility. This is tricky — I need your password to allow you access to your account, but I don’t want to actually know what your password is.
This is where hashing comes in. Imagine you create “password” as your password (pro-tip, don’t). Chop and Change would run a hash function on “password” and spit out hash code (5f4dcc3b5aa765d61d8327deb882cf99). It would then send the hash, not the word “password” to Chop and Change’s database to store against your records. Every time you wanted to login to your Chop and Change account, the password you give will be hashed using the same function: if the output matches the hash Chop and Change has recorded, then it will let you in. Even a slight change (e.g “Password”) will result in an entirely different hash (dc647eb65e6711e155375218212b3964) and Chop and Change would reject that password as incorrect.
This way, Chop and Change never has to store what my password actually is. They just need the fingerprint the password creates. But, as we mentioned above, that doesn’t mean it’s impossible to figure out. The word “password” is still (!) one of the world’s most common passwords. If you were to steal Chop and Change’s user records, you might get a download of lots and lots of hashes instead of plaintext passwords. But if you start to notice lots of the same hashes creeping up again and again, it wouldn’t take long for a hacker to figure out which common password this hash represented.
“Pass the salt”
Of course, just changing the hash function would not be enough in password security. There are gigabytes and gigabytes of password information on the DarkWeb. In the hacker toolkit, lists of common hashed passwords are known as a “rainbow tables”. The password in the intro, “123456”, was the most common password of 2017 according to a study by internet security firm SplashData. SplashData look at big data breaches like those experienced by Equifax and Yahoo to find out that as many as 3% of users used “123456” as their password in 2017. As humans, we’re not good at creating random strings and more often rely on patterns or dictionary words to create our passwords.
To beef up security, cryptographers recommend certain actions are taken behind the scenes when hashing. And, just like chefs, they know everything is better with a dash of salt. Passwords are often “seasoned” with something to avoid common passwords being easy to identify. These easy-to-implement tricks are intended to slow down and increase costs for hackers, hopefully putting off the opportunists.
One way is to “salt” the password. In this way, a random string of letters and numbers are added to the front or end of the user’s password before it’s hashed. It might look something like this: hash(salt|password) or hash(“8£@jP”|”123456”). In this way, a once common password “123456” become much more differentiated as “8£@jP123456”. This salt is unique to each user and could even be saved against their records. If a different salt is used for each user, a hacker won’t easily be able to identify whether that user has a vulnerable common password. To check each hashed password against a list of 20 common passwords, they’d have to run a new “salted” hash function for each user. This makes the process of guessing passwords much more costly in terms of computing power.
Anything that wants salt could do with pepper. In cryptography, “pepper” is a term used for an addition to a password that, like a salt, is generated randomly when the user first creates their password. However, unlike salt, a pepper is not stored. There is only a finite number of possibilities a pepper could be (e.g. 000–300) and so when the user types in their password, the system has to generate a hash for each possible variant (000, 001, 002, 003…). It does this until it finds the one that matches the stored hash. This should take only a few milliseconds to process. But if you were running a brute-force attack, trying to guess someone’s password, this would greatly increase the time it takes you to find it out. As a comparison, if you entered your password without a pepper, a hacker would have to make one hash per password guess. With a pepper, you might have 300 hashes per password guess. If it took you 1 day to work out password, it could take you 300 days with a pepper.
The Birthday Problem
You may have noticed a flaw in the hashes I’ve been using so far.
If we can take any digital content of any length, we have an almost infinite number of possibilities (with the right machine it would be infinite as we can just keep making the content longer). With a 32-character hexadecimal hash, we have a finite number of possibilities from 00000000000000000000000000000000 to ffffffffffffffffffffffffffffffff. It’s still a very large number, 1,208,925,819,614,629,174,706,176 possible combinations to be exact (that’s a septillion if you wanted to know — a 1 with 24 zeros after it). But it’s not infinite, so we will end up with duplicates.
This is one vulnerability the system has. In cryptographic terms, two or more pieces of content producing the same hash is known as a collision. The assumption is that it’s infeasible to run a process to identify a piece of content that has the same hash as yours. You’d need to run through all possible variants of content to find one that’s the same. This is known as a “brute force” attack and would require a lot of time and computing power to guarantee success. For some, that one piece of content can produce the same hash as another completely different piece of content is considered an interesting fact, but harmless in practice. The likelihood of being able to discover a collision like this is so remote.
Or is it? There is a problem if you don’t care which two hashes collide. This leverages what’s known as the “birthday problem” in mathematics. If you take a class of 25 students, the likelihood that a student was born on a specific day of your choosing, say 16th October, is fairly small (around 7% chance, excluding leap years for simplicity). However, the probability that any two people share the same birthday is much higher, around 50%. In fact, you only need a group of 70 people to be 99.9% confident two people will share the same birthday.
We can use this “birthday problem” to cheat a system protected by hashing. In this example, researchers created two PostScript files (PostScript is a language that describes the appearance of a printed page): one, a letter of recommendation for an intern, and another, a letter authorising the intern to access confidential files. When viewed with a document viewer, they look different. However, there were hidden additions in the code, undetectable unless you looked. The researchers had played with enough variants of code in both documents to find a hash output that was the same. The intern’s boss would sign the letter of recommendation using its hash as confirmation that this was the attached document he was approving. All the intern would need to do is swap in the authorisation letter and she’d have approved access to all the files.
We can avoid this type of attack by using more collision-resistant hash functions. SHA3–512 (the hash function I mentioned earlier) produces hashes that are a lot longer than those produced by MD5, so it’s harder to find collision opportunities.
Don’t believe me? See the difference for yourself:
Now the question is, what did I say? There are online generators you can play around with: MD5 and SHA3–512. Have a go and let me know. You never know, we might find a collision.