AES algorithm and its Hardware Implementation on FPGA- A step by step guide

11 min readAug 21, 2020

In this post we are going to find out the Step By Step implementation of AES-128 bit algorithm on FPGA/ASIC platform using Verilog language. It has been divided in two sections, i.e. Background and Working of AES Algorithm and The Block implementation of AES using Verilog.

1. Background

Encryption is a process of converting ordinary data(plain data) into intelligent text (Cipher text). Encrypted data must be decrypted, before read by recipient. This is called Decryption process. Since past few years a lot of researches are going on to efficiently increases the utilisation of this methodology in multimedia application, and a lot of encryption standard came into existence starting from asymmetric to symmetric standard, ranging from DES to AES.
Previously DES was used but it could easily be broken as it had more vulnerabilities . In 1999, at DES Challenge III, it took only 22 hours to break cipher-text encrypted by DES, using brute force attack! The main reason why DES is not secure is because of its short key length which is only of 56-bits. After DES, 3DES was used which is a variation of DES and more secure but still does not provide the adequate performance. Then AES came into picture which is a more feasible and reliable approach.
After AES got included in ISO/IEC 18033–3 standards, it become first public cipher approved by NSA, it attracted more and more researchers and engineers to apply it on real time applications. AES also enables faster encryption than DES, which is optimal for software applications, firmware and hardware which require low latency or high throughput. Thus it is used in many protocols such as SSL/TLS and can be found in various modern applications and devices.

In this cutting edge era our objective is likewise concentrating on low hardware utilisation, increase of speed and low power utilisation. To satisfy necessities, we move from AES on software approach to AES on hardware.

1.2 AES algorithm

AES-128/198/256 bit requires 10/12/14 rounds respectively to complete the full operation. For AES-128 bit the input data is 128 bits and input key is also 128-bit and each round requires 1 cycle to complete. The AES architectural Flow is shown below:

Internally, the AES algorithm’s operations are performed on a two-dimensional array of bytes called the State. So, at the beginning of the Cipher or Inverse Cipher, the input array, ‘in’, is copied to the State array according to the scheme: s[r, c] = in[r + 4c]. The four bytes form 32-bit words in each column of the State array, where the row number r provides an index for the four bytes within each word. Accordingly, the state can be represented as a one-dimensional sequence of 32-bit words (columns), w0 … w3, where the column number c provides an index. State can be considered as an array of four words, as follows:

w0 = s0,0 s1,0 s2,0 s3,0 w2 = s0,2 s1,2 s2,2 s3,2

w1 = s0,1 s1,1 s2,1 s3,1 w3 = s0,3 s1,3 s2,3 s3,3

Each round of AES algorithm contain few steps shown below (except round 10):

Add round key
Substitute bytes
Shift rows
Mix columns

Lets get dive inside each steps :

ADD Round key: In the AddRoundKey() transformation, a Round Key is added to the State by a simple bitwise XOR operation. This is the first step of AES algorithm and this is simply a XOR operation. We have 128-bit length plaintext and 128-bit length key so XOR operate bit by bit as shown below:

The matrix of 16 bytes are consider as 128 bits and x-ord to 128 bits of the round key. If last round is this then output is 128 bits Encrypted output. Otherwise, these 128 bits will again go to the similar round considering 16 bytes.

2. SUB-BYTES Transformation : It is a non-linear transformation where byte is replaced with a value in S-box. The S-box is predetermined for using it in the algorithm.

S-box is used to substitute data. Simply we can see S-box as a lookup table. The way to substitute bytes for block is like each block have 8-bit data, and we can see first 4-bit as row index and the last 4-bit as column index, using these row and column index we can get the value from the S-box.

3. SHIFT ROWS: In this operation, each row of the state is cyclically shifted to the left, depending on the row index. The 1st row is shifted 0 positions to the left. The 2nd row is shifted 1 position to the left. The 3rd row is shifted 2 positions to the left. The 4th row is shifted 3 positions to the left.

4. MIX COLUMN: The Mix Columns transformation operates on the State column-by-column, treating each column as a four-term polynomial. The columns are considered as polynomials over GF (28) and multiplied modulo x4 + 1 with a fixed polynomial a(x), given by
a(x) = {02}x3+ {03}x2 + {01}x1 + {02}

S’(x)=a(x) xor s(x)
The above equation can be described in the matrix form as below:

The polynomial equation above matrix is

S’(0)=({02}*S 0,c) xor ({03} *S 1,c) xor S 2,c xor S 3,c

S’(1)=({02}*S 1,c) xor ({03} *S 2,c) xor S 0,c xor S 3,c

S’(2)=({02}*S 2,c) xor ({03} *S 3,c) xor S 0,c xor S 1,c

S’(0)=({02}*S 3,c) xor ({03} *S 0,c) xor S 2,c xor S 1,c

Key Generation: The AES key expansion algorithm takes as input a 4-word key and produces a linear array of 44 words. Each round uses 4 of these words Each word contains 32 bytes which means each sub-key is 128 bits long. Pseudo code for generating the expanded key from the actual key. The key is copied into the first four words of the expanded key.

The remainder of the expanded key is filled in four words at a time. Each added word w[i] depends on the immediately preceding word, w[i − 1], and the word four positions back w[i − 4]. In three out of four cases, a simple XOR is used. For a word whose position in the array is a multiple of 4, a more complex function is used.

The symbol g to represent that complex function. The function g consists of the following sub functions:

RotWord performs a one-byte circular left shift on a word. This means that an input word [b0, b1, b2, b3] is transformed into [b1, b2, b3, b0].
SubWord performs a byte substitution on each byte of its input word, using the s-box described earlier.
The result of steps 1 and 2 is XORed with round constant, Rcon[j].

2. Hardware implementation using Verilog:

To implement AES-128, it is first written in Verilog language. The design is Complete 128-bit mode which is synthesised and verified. The design consists of both encrypt module and key scheduler under the top module. Lets first get inside the each block implementation.

Before implementation, its important to understand some data manipulation before coding

128 bit data treated as 32words where each word = 32 bit

Total words = 44 as the number of key generation

Total key=11

1. S-box: The s-box takes a lot of space in hardware as it takes a lot of registers to hold the predefined value as defined in AES s-box table above. The input and output signal are 8 bit in size named data and dout respectively. The snap of code shown below:
Note: It's not the complete code. Full code is on my Github account and the link is shared at the bottom.

Tip: always use blocking statement for combinational circuit to reduce the hardware utilisation.

2. Sub-bytes: To replace a byte with a value in S-box we have to instantiate the S-box inside our Sub-byte module. Since each steps in every round operates on 128 bit data so S-box has to be instantiated 16 times in the sub-byte module.

The verilog code of sub-byte shows below:

Here tmp_out used as internal wire for data taken out from each S-box and finally assigned to data_out as 128 bit .

3. Shift Rows: As each row of the state is cyclically shifted to the left, depending on the row index, we have to shift each byte as shown the code below:

Inputs are 128 bit data out from sub-byte block and data sent to mix column block.

4. Mix column: In mix column block, considering the state matrix shown above we have to multiply each byte of state matrix with a predefined modulo number. So to achieve this, first thing is to define multiplier. Example, if we consider x as 8 bit input, we have to generate 2x and 3x output and finally integrate in one module to generate our polynomial.

a(x) = {03}x3 + {01}x2 + {01}x + {02}x1

In hardware we have to define multiply simply by shifting the bits left, for example, for 2x we have to simply shift the bit by 1 to left side. The code below shown its implementation.

Next step is to instantiate both the module on mul_32 module so as to perform operation on 32 bit. Firstly the 32 bit data are divided into each 8 bit tmp so as to perform operation on it. Here the m2_tmp_out and m3_tmp_out are used as mul_2 and mul_3 module out that are x-ored according to the polynomial equation shown in mix column block explanation:

Final integration happened and operation performed on 128 bit data for that mul_32 module instantiation 4 times. Below shown code is for Mix_column:

5. Key generation: Apart from Encryption and Decryption Module, another main component is Key Expansion Schedule. The security factor of the AES Encryption / Decryption Standard mainly depends on this part. For better security, AES Algorithm says that in first round user key is XORed with the original Plain / Cipher Text. And next round onward Expanded Key from Expanded Key Schedule is XORed with data. The expansion algorithm of the AES is fixed.

Inside modules are Rcon and subword or generally g matrix and one key expansion mode. Let’s first talk about rcon and subword.

Tip: If you want to increase the speed of the process of Key Generation, it is optimal to use pipeline architecture but Keep in mind the hardware resource utilisation.

Rcon: The Round constant also called Rcon is used to generate the round constant for each word generation

Subword: SubWord() is a function that takes a four-byte input word and applies the S-box to each of the four bytes to produce an output word. Here the catch is that, subword are generated by S-Box whose input is word and in the generation of word previous subword is used. So a subword is generated by using word and a next word are generated by previous subword.

Confused.. lets first understand how a subword is written in the code here, firstly S-box takes w3[23:0] data as input and generate subword [31:0].

Key expansion: Lets talk about the input, only 128 bit key as input are taken in the module and capable of generating 11 key so around 128*11 bits are generated as output form key expansion which are defined first key for initial addround key module and 10 keys for all 10 round. The name of outputs are key_s0 …..key_s10.

The internal used registers word[32] 4 for each one key generation

for ex:

key_s0={w0,w1,w2,w3}; // four words are used

…. key_s10={w39,40,41,42,43};

All the operations are performed on words

Total 44 words are generated and each word generation require previous key/word and subword generated from S-box and rcon. Initially w0 ..w3 are simply used as it is in addround before the first round happened. The generation of w3 is used in the generation of subword1.

Now move to w4..w7 ,for the generation of words keys are used that is Xored with subword1 form S-box generation inst u0,u1,u2,u3 and final Xored with constant rcon. The generation of w7 is used in the generation of subword2 and same process continue to generate all 44 words and finally concatenate to generate final 11 keys.

Round: The round module used to integrate the Sub module in a single round. Total 10 rounds are used in 128 bit AES. As discussed previously about the implementation of each step module. Let’s instantiate them finally to generate one round. Here in the code goes to 4 steps except the last one which uses 3 steps.

AES_Main: This is the top module where key expands and rounds are instantiated to produce the final AES encryption output. The reverse is done in AES decryption.

For complete code check the below Github link:

Gourav0486/AES-Core-engine-

PROJECT TITTLE: INFO: AES Encrypt core: >DATA -128 bit,KEY-128 bit >Single pipelined approach >ECB mode FILE INFO: AES…

github.com

The waveform after applying test bench is shown below:

The salient features of the AES Encryption/Decryption are summarised in the following manner:

HIGH THROUGHPUT: Pipelined implementation, increases the throughput with larger Key Size but silicon area/hardware utilisation also increased.
PARAMETER FLEXIBILITY: Any combination of Key sizes and Block sizes those are multiples of 32 bits can be accommodated. As a result, number of rounds can be modified.
IMPLEMENTATION FLEXIBILITY: Decryption can be implemented in same structure as Encryption. (Though with different components)
NO KNOWN SECURITY ATTACK: Although it has received criticism due to its simple mathematical structure.

Kudos! if you have reached this far, I assume you are good to get your hands dirty with the code.