Part 5: Hashing with SHA-256
An overview of SHA-256, a standard secure hash function, as described in official document FIPS 180–4.
This continues a series on bitwise operations and their applications, written by a non-expert, for non-experts. Follow Biffures for future updates.
Hash functions transform arbitrary large bit strings called messages, into small, fixed-length bit strings called message digests, such that digests identify the messages that produced them with a very high probability. Digests are in that sense fingerprints: a function of the message, simple, yet complex enough that they allow identification of their message, with a very low probability that different messages will share the same digests.
In SHA-256, messages up to 2⁶⁴ bit (2.3 exabytes, or 2.3 billion gigabytes) are transformed into digests of size 256 bits (32 bytes). For perspective, this means that an object 7 times the size of Facebook’s data warehouse in 2014 passed to SHA-256 would produce a chunk of data the size of a 32-letter string of ASCII characters, and that string would the object’s very special fingerprint.
A prominent use case of hashing is data integrity verification of large files, which relies on the comparison of actual and expected message digests, or checksums. Another is hashing as part of the encryption/decryption journey. Before a message can be encrypted with an algorithm like RSA, it needs to be hashed. In the rest of this article, we explore what hashing does to a message, with a view to later develop a better understanding of RSA.
Step by step hashing with SHA-256
This article now draws heavily from FIPS 180–4 all the while trying to offer some simplifications, but for the full details you may want to refer to the source material instead. With this in mind:
Pre-processing
1. Padding. If we note M the message to be hashed, and l its length in bits where l < 2⁶⁴, then as a first step we create the padded message M’, which is message M plus a right padding, such that M’ is of length l’, a multiple of 512. Specifically, we use a padding P such that M’ is:
2. Blocks. M’ is parsed into N blocks of size 512 bits, M¹ to Mᴺ, and each block is expressed as 16 input blocks of size 32 bits, M₀ to M₁₅.
3. Hash initialization. The initial hash value H⁰ of length 256 bits (8 input blocks of 32 bits) is set by taking the first 32 bits of the fractional parts of the square roots of the first eight prime numbers:
Those values can be reproduced with the below snippet in most browsers and Node ≥8.2.1 (ECMAScript 2017):
Algorithm
The hash is produced by processing each message block Mⁱ of M’ in order. For each of message block Mⁱ:
1. Message schedule. We create a message schedule Wⁱ, consisting of four 512-bit message blocks (each made of 16 input blocks). The first block of Wⁱ is message block Mⁱ, and the next three blocks are variations of Mⁱ, obtained through the formulas in the illustration below:
A detail: note that if we align all message schedules Wⁱ vertically, the first column reads from top to bottom as the complete message M’ = M¹‖..‖Mᴺ.
2. The big shuffle. The input blocks of message schedule W are fed, one after the other, to a function represented below as a graph. The graph takes as inputs a hash ωⁱ(t) and a message schedule input block Wⁱ(t), and outputs a hash ωⁱ(t+1). The initial hash ωⁱ(0) fed to the graph is the intermediate hash Hⁱ⁻¹: in the case of W¹, it’s H⁰ defined in the pre-processing step. ωⁱ(0) and Wⁱ(0) produce ωⁱ(1); in turn ωⁱ(1) and Wⁱ(1) produce ωⁱ(2), etc., until ωⁱ(63) is produced.
3. New hash. After all input blocks from Wⁱ have been used and we ω(63) has been created, we can create the new hash Hⁱ such that each input block of Hⁱ is the sum of the corresponding input block of Hⁱ⁻¹ plus the corresponding input block of ωⁱ(63):
Hⁱ(j) = Hⁱ⁻¹(j) + ωⁱ(63)(j) where + is the addition modulo 2ⁿ
If other message blocks Mⁱ remain, repeat the process (message schedule, big shuffle, creation of the new hash Hⁱ)
If Wⁱ was the last message schedule, then Hⁱ = H is message M’s final hash or digest — its so very special fingerprint.
This concludes the overview of SHA-256 as described in FIPS 180–4. A few closing thoughts:
- SHA-256 projects into B²⁵⁶, a space of ~1e77 possible values, which is lots of potential digests: a good thing that provides the intuition that collisions are unlikely
- That being said, we do not prove here that collisions are unlikely, we even know they exist given the surjective nature of the hashing function. We know however that (i) given a limited amount of things to hash, we’re unlikely to find collisions (ii) no collisions have been found to date for SHA-256.
- Finally, coding your own SHA-256 for production use is probably not a good idea…
…but just for fun and education purposes, here is a version I wrote in JavaScript. Thanks for reading, hope you enjoyed, and see you next time for a review of RSA.