Python Implementation of SHA-256 from Scratch

Paulo Doms
14 min readJul 20, 2021

--

Photo by Garett Mizunaka on Unsplash

Have you ever driven a car and wondered how it actually works? Under the hood, that is? Well, you don’t have to in order to drive a car and neither do you need to be able to re-create the code that computes the Secure Hash Algorithm II, or, in our case SHA-256, but it is definitely fun. I want to share with you the code in an attempt to motivate other beginners and more advanced developers to look behind the libraries that are so often used in any given programming language. Even though, needless to say, that these libraries are intended for production, while this implementation is not, so bear that in mind.

A beginner’s knowledge of Python should suffice. Math is not really required, but having knowledge of the different Basic Representations of data, i.e., converting characters to integers, Base 2 and Base 10 and Base16 conversions, is definitely an advantage, even though we will look into those as well.

The code is designed in such a way that it is easy to read and follow (e.g., little nesting or hardly any list comprehensions are utilized), and the lines in the code blocks are annotated, which I always find very helpful. Anyways, this should also allow you to skip all the explanations in-between and just take the code into account, if that’s what you prefer.

Full code/git repo can be found here.

I won’t get into detail about the nature of hashing or the background of the SHA-256 algorithm, as this is all readily available online. For example here at Qvault or the official specification from the NIST.

Edit 13.02.2022: Urfasurfas pointed out that the preprocessMessage() function does not handle message lengths between 449 and 512bits, which has now been corrected.

1) Data Types and Data Representations

Secure Hash Algorithms work on the bit level. That’s why understanding the relevant datatypes is almost mandatory, as we want to take a message and hash it, and, finally, at least for this implementation, will present the hash in a hexadecimal notation.

a) From Character to Bits

Taking a string as an input, each character needs to be first translated to its corresponding numeric representation. Either ASCII or Unicode will do. Luckily, Python has a built-in function ord() that does this for us. For Example, ord(‘H’) corresponds to the integer 72.

Python’s built-in integer to binary converter bin() is helpful as well. However, we will slightly alter the result as we need to rely on standardized bit lengths for each character representation (in our case it’s a byte or 8 bits) and want to use lists of zeros and ones instead of Python’s binary strings.

print(bin(72))

Output:

'0b1001000'

Where ‘0b’ let’s Python know it’s a binary string. So we will have to chop this indicator off, and then we are left with 7 bits. We thus need to prepend another zero. And finally, we will cast each character to an integer. The bellow function takes care of all the conversions and returns a list of zeros and ones.

While it’s necessary to make sure that each character is presented by 8 bits, we will have only a one dimensional list, as opposed to a list of lists of 8 bits as this will help further on. For example,

print(translate('XO'))

will yield:

[0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1]

b) From Bits to Hex

It is quite common, but not mandatory, that the hash value is presented in hexadecimal notation. Here is where the SHA-256 gets its name from, the length of its output in bits is 256, (there is also, for example, SHA-512, with a, well, you guessed it, 512 bits length hash value). In more detail, it’s actually 8 hashed values of each 32 bits (more on that later). Since we operate up to that point on the bit level, we also need a function that converts from Base 2 to Base 16.

For example,

print(b2Tob16([0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1]))

puts out:

'584f'

Note that the output is without the Python hex-indicator for strings. As we will concatenate all 8 hash values in the end, we don’t need it.

c) Utility Functions

To be able to accommodate both our needs of set bit lengths and utilizing the built-in Python functions, as well as being able to conveniently fill words with zeros, we’ll us the below utility function to simply append or prepend a required number of zeros to a list.

Lastly, we will also need to be able to divide a list of bits into desired chunks of 512 bits and later to 32 bits. We’ll use the function below:

That’s it for now with data types and presentation and we will move on to talk about some hardcoded values and constants that the algorithm requires.

2) Initial Hash Values and Round Constants

a) Initial Hash Values

The initial hash values, h, are usually hardcoded and defined by the NIST (see link above). We take them as they are in hexadecimal values and translate them to our preferred binary representation — a list of 32 bits.

h = ['0x6a09e667', '0xbb67ae85', '0x3c6ef372', '0xa54ff53a', '0x510e527f', '0x9b05688c', '0x1f83d9ab', '0x5be0cd19']

The values are the first 32 bits of the fractional parts of the square roots of the first 8 primes. It took me a while to fully understand what these values actually are, so I think it’s a good idea to give an example of the calculation and conversion of one of these values (h[0]):

You won’t need to use this code as we will use the hardcoded values and initialize them with an initializer function, we will define further below.

b) Round Constants

A set of constants (k), which will be used to mix into the hash digest, are the first 32 bits of the fractional parts of the cubic roots of the first 64 prime numbers. Here are the corresponding hex values. These are hardcoded as well.

k = ['0x428a2f98', '0x71374491', '0xb5c0fbcf', '0xe9b5dba5', '0x3956c25b', '0x59f111f1', '0x923f82a4','0xab1c5ed5', '0xd807aa98', '0x12835b01', '0x243185be', '0x550c7dc3', '0x72be5d74', '0x80deb1fe','0x9bdc06a7', '0xc19bf174', '0xe49b69c1', '0xefbe4786', '0x0fc19dc6', '0x240ca1cc', '0x2de92c6f','0x4a7484aa', '0x5cb0a9dc', '0x76f988da', '0x983e5152', '0xa831c66d', '0xb00327c8', '0xbf597fc7','0xc6e00bf3', '0xd5a79147', '0x06ca6351', '0x14292967', '0x27b70a85', '0x2e1b2138', '0x4d2c6dfc','0x53380d13', '0x650a7354', '0x766a0abb', '0x81c2c92e', '0x92722c85', '0xa2bfe8a1', '0xa81a664b','0xc24b8b70', '0xc76c51a3', '0xd192e819', '0xd6990624', '0xf40e3585', '0x106aa070', '0x19a4c116','0x1e376c08', '0x2748774c', '0x34b0bcb5', '0x391c0cb3', '0x4ed8aa4a', '0x5b9cca4f', '0x682e6ff3','0x748f82ee', '0x78a5636f', '0x84c87814', '0x8cc70208', '0x90befffa', '0xa4506ceb', '0xbef9a3f7','0xc67178f2']

c) Initialize the Values

To initialize them in the algorithm and convert them to a list of bits, we’ll take the below code:

3) Preparing the Message (Padding)

The first step will actually always be to prepare, or pad the message. Which can be summarized in the following three steps:

  1. append a single ‘1’ to the end of the message
  2. determine the bit length of the message
  3. chunk it in a way that the total bit length is a multiple of 512

So much for the theory, the below code does all of this, let’s have a look.

Note the three distinct ways we handle the length of the message. If shorter than 448 bits, we simply append the 1 and add the length in bits (see below). If between 449 and 512 bits, we go to the next multiple of 512, which is 1024 bits and follow procedure. Finally, if the message is longer than 512bits, we determine the total length in a while loop.

Let’s feed the function with ‘hello world’ and print the result.

m = preprocessMessage('hello world')
print(m)

Output:

[[0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0]]

In order to not make this part explode in zeros and ones, we have chosen an input that stays below the 448 (including the appended ‘1’) bits. ‘hello world’ has a bit length of 88 (11 characters, each one byte). You can see, m[0][88] is the appended ‘1’ and the m[0][448:] represent the length of 88 bits in 64 bit big endian binary representation. Exactly what SHA-256 requires, and thus, we can move on to the actual algorithm.

4) The Algorithm

Putting the algorithm in rather casual terms, we can say that there is a main loop, which iterates over the 512-bit chunks created in the preprocessing function. Each chunk is then sort of shuffled with itself and a lot of zeros, and then subjected to heavy digestion and compression. Lastly, the initial hash values are updated with the current hash/digest. Which are, after the last iteration of the main loop returned in hexadecimal representation.

a) Utilities

Before we start, we need to get two set of functions out of the way. These are merely utility functions and required the way they are because of using lists instead of Python’s built-in binary representations.

These utils are basically logic gates or boolean functions, and we will use them bit wise — see the functions where the names are in caps-lock. I recommend spending some time with these, as a) they are fun, b) basic knowledge of computer science, c) there are multiple ways of creating the same results and d) are easy to test.

We will also utilize a full binary adder and need to use shift and rotate operations.

The shift and rotation functions are pretty straight forward as they are merely list indexing and concatenation operations. The full binary adder takes two lists of binaries as an input and adds them bitwise. Given that the two bits exceed 2, a carry over is taken into account.

Here’s a simple addition ():

print(add([1,0,1], [1,1,1]))

The sum in binary is:

[1, 0, 0]

The full adder, as compared to the half adder, by default takes a carry over bit. If the last carry over, is a ‘1’, it will, as in our case, be dismissed as we stick to the predefined lengths. In the example above, we added 5 + 7, which is 12. However, [1,0,0] is the base 2 representation for 4, as the carry over bit got neglected.

b) Pre-Main-Loop Preparations

To handle scope and to be able to update after each iteration in the main loop, we will initialize the hash values (h0, h1, … h7) and the round constants k within our Python function, which we will piece together later. To do so, the initializer function defined earlier will do its job here.

c) The Main-Loop

As mentioned above, the main loop iterates over the 512-bit chunks, i.e., the whole message. Within the iterations the following steps are performed:

i) Message Schedule

Below is the first part of the main loop, in which the message schedule (w) is created.

The 512 message block is divided into words of 32 bits, and then 48 more words of 32 zeros are appended. The length of w becomes thus 64. To get an idea, here is a snippet of the result of the ‘hello world’ example at this state:

[[0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0], [0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1], [0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ...[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Starting from w[16] — the first zero-only list — we will iterate over the message schedule, i.e., its 32-bit words, and perform all of the following operations on each iteration i. Let’s start with a run-through in pseudocode:

sigma0 = (w[i-15] rotate right by 7) xor 
(w[i-15] rotate right by 18) xor
(w[i- 15] shift right by 3)
sigma1 = (w[i-2] rotate right by 17) xor
(w[i-2] rotate right by 19) xor
(w[i-2] shift right by 10)

And then w[i] becomes the sum of w[i-16], sigma0, w[i-17] and sigma1, like so:

w[i] = w[i-16] + sigma0 + w[i-17] + sigma1

Here’s the Python code:

Below you can find selected snippets of the first iteration, which show intermediate states of the 32-bit words.

By tracing the steps of the algorithm, it should become clear what is meant with shuffling the 32-bit words. The rotations and shifts, do their part, so does the xor operation and finally the replacing value of indices of the message-chunk list.

Then we will initialize the intermediary variables a, b, c, d, e, f, g, h with the initial/current hash values h0, h1, h2… h7, before we move on to the next loop.

By the way, we are still in the main loop. That is, within a 512-bit message chunk. The previous loop, the working variables initialization, the compression loop immediately below, and the updating of the hash values are each at the same indent level (level 2, if counting function definition, which we haven’t written down yet).

ii) Compression

I dare say, it gets wild now! This part of the algorithm is heavily cooking up the bits and words. The constants from k will mix in as well.

This loop will iterate the range 0 to 63, each iteration j involves the following operations:

Sigma1 = (e rotate right by 6) xor 
(e rotate right by 11) xor
(e rotate right by 25)
choose = (e and f) xor (not e and g)temp1 = h + Sigma1 + change + k[j] + w[j]Sigma0 = (a rotate right by 2) xor
(a rotate right by 13) xor
(a rotate right a by 22)
majority = (a and b) xor (b and c) xor (b and c)temp2 = Sigma0 + majority

Two things to note here. First, the variable choose stands in for a so called Choose Function, also referred to as if-then-else function. And second, the result of the boolean function that is assigned to the variable majority, is basically just another way of writing the maj(i,j,k) (i,j,k being the function parameters, not the variables used above) utility function we defined in 4).

Next is a round of re-assigning the intermediary variables in the following fashion:

h = g
g
= f
f
= e
e
= d + temp1
d
= c
c
= b
b
= a
a
= temp1 + temp2

Here is the actual Python code of these steps:

And again, a last set of run-throughs, namely the operations that lead to temp1 and temp2:

The heavy part of the algorithm is thus done. We are almost there!

iii) Updating the Hash Values

What follows now is merely another round of reassigning the hash values h0 to h7 (in this example, the initial hash values, if the length of the message ≥ 448 and this would be iteration i+1, they become the current hash values) to the results of the compression loop, i.e., the state of the variables a through h when j = 63. We add the initial/current hash values with the intermediary variables and make the results the current/new hash values. Ok, maybe that sounds a little confusing, but here is the Python code (I spare you the pseudo code and the code snippets this time).

This whole process from creating the message schedule, compressing the data and re-assigning values is done for each message block. We leave the main loop now and are left with eight 32-bit words that are essentially the finished hash. However, we will still digest it to a hexadecimal representation, because it’s sort of best practice, and easier to the eyes.

iv) Hex Digest

Since we have already established the function to convert lists of 32 bits to a hexadecimal value, we simply utilize this code to append the hash values h0 to h7 to each other.

What is worth mentioning is that we are performing this last step after the main loop has finished, the return value of digest is the return value of the algorithm.

v) The sha256 Function

This should help to gain orientation with the indentations and when what operation is computed.

Test

We will keep it brief in this domain, but it is worth mentioning that test vectors are readily available, e.g., here. We will run the following tests:

#example used above
hash_example = sha('hello world')
print('example: ', hash_example)
#bit length: 0
hash_0 = sha('')
vector_hash_0 = 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
print('bit size 0: ', hash_0 == vector_hash_0)
#bit length: 24
hash_24 = sha('abc')
vector_hash_24 = 'a7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad'
print('bit size 24: ', hash_24 == vector_hash_24)
#bit length: 448
hash_448 = sha('abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq')
vector_hash_448 = '248d6a61d20638b8e5c026930c3e6039a33ce45964ff2167f6ecedd419db06c1'
print('bit size 448: ', hash_448 == vector_hash_448)
#bit length: 896
hash_896 = sha('abcdefghbcdefghicdefghijdefghijkefghijklfghijklmghijklmnhijklmnoijklmnopjklmnopqklmnopqrlmnopqrsmnopqrstnopqrstu')
vector_hash_896 = 'cf5b16a778af8380036ce59e7b0492370b249b11e8f07a51afac45037afee9d1'
print('bit size 896: ', hash_896 == vector_hash_896)

Will/should print the following:

hash_example: b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9
bit size 0: True
bit size 24: True
bit size 448: True
bit size 896: True

Conclusion

There is not much to say other than ‘you have now successfully written a Python script that computes a Secure Hash with an output length of 256 bits’. Wow. You will probably never need to compute it yourself (it’s also not secure to do so!), but then you will most likely never have to build your own car; anyways, it is somewhat satisfying to understand what happens to a password when you sign up or log in somewhere, isn’t it?

--

--