# Hashing Algorithms

Let’s say you have an important file to send and you want to ensure it will get to the addressed without any changes, in one piece. You could use some trivial methods, like sending it multiple times, contact with the addressed and verify the file, and so on… but there’s a much better approach and it’s called hashing algorithm.

### Hash

Hashing algorithm’s goal is to generate a safe hash, but what is a hash? Hash is a value computed from a base input number using a hashing function. Shortly, the hash value is a summary of the original data. For instance, think of a paper document that you squeeze and squeeze that finally you aren’t even able to read the content. It’s almost (in theory) impossible to restore the original input without knowing what was the starting data.

Let’s take an example of a hashing algorithm:

We could discuss if it’s a secure algorithm. We can help you — it isn’t. Of course, every input number is individual (we’ll talk more about this in the further sections), but it’s easy to guess how it works. But that just shows the idea.

### Hashing algorithm

Hashing algorithm is a cryptographic hash function. It is a mathematical algorithm that maps data of arbitrary size to a hash of a fixed size. It’s designed to be a one-way function, infeasible to invert. However, in time a lot of hashing algorithms are being compromised. This happened to md5, for example, a widely known hash function designed to be a cryptographic hash function, which is now so easy to reverse, that we could only use it for verifying data against unintentional corruption.

It’s easy to figure out what the ideal cryptographic hash function should be like:

1. it should be fast to compute the hash value for any kind of data
2. it should be impossible to regenerate a message from its hash value (brute force attack as the only option)
3. it should avoid hash collisions, each message has its own hash.
4. every change to a message, even the smallest one, should change the hash value. It should be completely different. It’s called the avalanche effect

Even the smallest change (one letter) makes the whole hash different (SHA-1 example)

### What we use it for?

Cryptographic hash functions are used notably in IT. We can use them for digital signatures, message authentication codes (MACs), and other forms of authentication. We can also use them for indexing data in hash tables, for fingerprinting, identifying files, detecting duplicates or as checksums (we can detect if sent file didn’t suffer accidental or intentional data corruption). We’ll show you an example of the last feature.

Hash tables
Hash Tables by CS50

### Example

So… how does it work? Let’s get back to our example. We’re sending a file to our friend. It’s a really important file and we want to ensure it has been received in one piece. That’s when our hashing algorithm comes in. But before, let’s think how our file transfer would look without it:

We can figure out some trivial ideas. You could, for instance, call the User2 and you could check the file content together. But then what’s the point in sending a file? Checksums are our godsend here.

Before sending a file, User1 uses hashing algorithm to generate a checksum for a file. Then he/she sends it alongside the file itself. User2 receives both the file and the checksum. Now he/she can use the same hashing algorithm on the received file. What’s the point? We already know that a hash is individual (so there can’t be any other file with the same hash) and has to be always the same for an individual file. No matter how many times you use the algorithm, it will always give you the same result. So now, User2 can compare both hashes. If they’re the same, it means it’s generated from the same file. There is no way that any other file has the same hash and there is no chance for a hash to be different for the same file.

This way User2 can verify if the file isn’t in any way corrupted. Easy? Certainly.

### Popular hashing algorithms

MD5

Before we go any further — md5 is completely broken! If you ever learned any programming language and it was some time ago, you surely know this algorithm. I think it’s one of the best known. It was used for many years. It’s still widely used, but despite it was initially designed to be used as a cryptographic algorithm function, due to extensive vulnerabilities, it has been compromised. We already know that the secure hashing algorithm can’t allow collisions, and in md5 it’s fairly easy to manipulate a document by injecting a malicious code while still getting the same hash! One of the things that killed it was its popularity. It was used so much, that the best tool for cracking md5 hashes is now… Google. Typing the hash in the search box, you’ll receive its before-state within the seconds!

Now let’s look at this example:

You could think you are secure if your passwords are stored as MD5 hashes, but if somebody gets access to your database, he can just type the hash to google and he’ll get its real value! If you want to know more about hashing passwords and its security, see our previous article — How to Store Passwords Safely.

The CMU Software Engineering Institute considers MD5 essentially “cryptographically broken and unsuitable for further use”. It was accepted for many years, but it’s now mainly used for verifying data against unintentional corruption.

SHA-family

Secure Hash Algorithm is a cryptographic hash function designed by the United States’ NSA. SHA-0 (published in 1993) has been compromised many years ago. SHA-1 (1995) produces a 160-bit (20-byte) hash value. It’s typically rendered as a 40 digits long hexadecimal number. It has been compromised in 2005 as there have been found theoretical collisions (https://sites.google.com/site/itstheshappening/), but the real “death” occurred in 2010, when many organizations started to recommend its replacement. The big three — Microsoft, Google and Mozilla have all announced that their browsers will stop accepting SHA-1 SSL certificates by 2017. This process might end sooner, however, as there have been multiple successful attacks (https://it.slashdot.org/story/15/10/09/1425207/first-successful-collision-attack-on-the-sha-1-hashing-algorithm). SHA-1 was built on principles similar to those used in the design of the MD4 and MD5. It has a more conservative approach though.

Safer, for now, is the SHA-2. SHA-2 includes a lot of important changes. Its family has six hash functions with digests: SHA-224, SHA-256 or 512 bits: SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, SHA-512/256. It’s a lot more complicated and is still considered safe. Unfortunately, we can expect that it’s going to be compromised someday and the next option for the future should be…

SHA-3 (Secure Hash Algorithm 3) designed by Guido Bertoni, Joan Daemen, Michaël Peeters and Gilles Van Assche. Their algorithm Keccak won the NIST contest in 2009 and has been adopted as official SHA algorithm. It was released by NIST on August 5, 2015. Keccak is a lot faster than SHA-2 (from 25% to 80%, depending on implementation). It uses the sponge construction. The data is first “absorbed” into the “sponge” and the result is “squeezed” out. While absorbing, message blocks are XORed into a subset of the state. Then it’s transformed as one element. While squeezing, output blocks are read from this element, but alternated with state transformations. Keccak’s authors have proposed additional features like authenticated encryption system and a tree hashing scheme, but they aren’t standardized yet. It’s the safest hashing algorithm by now.