Symmetric Cryptography in Depth

13 min readApr 7, 2024

In TLS, the actual process of sending and receiving of data is done using Symmetric Encryption. Let us take a look at ChaCha20 and AES, two popular symmetric encryption mechanisms.

AES [7]

In AES Encryption, A Block x (x= 128 bits) is passed through the AES Function for n rounds (n=10,12,14 for Block Lengths of 128,192 and 256 bits) to produce an output block y of 128 bits.

The AES Function itself consists of 4 transformations as shown below. The input of 128 bits is fed as 4 input groups of 32 bits each.

Internal AES Structure

The S-Box transformation (shown as S in the above diagram) looks like this.

So S-Box of the byte C2 will be (25)hex for example.

Shift Rows Sublayer

The ShiftRows transformation cyclically shifts the second row of the state matrix by three bytes to the right, the third row by two bytes to the right and the fourth row by one byte to the right. The first row is not changed by the ShiftRows transformation. If the input of the ShiftRows sublayer is given as a state matrix B = (B0,B1,…,B15), output is as given below.

Mix Column Sublayer

This is merely the multiplication of the above output by a constant matrix. The multiplication is modulo irreducible polynomial P(x) = x⁸ +x⁴ +x³ +x+1.

For example if the input is (25,25,25,25) the output is also (25,25,25,25). The Polynomial Math looks like the below.

The Ci does not need the modulo as its power is less than 8. 25 is first converted to a hex value (11001)hex. Then it is converted to a polynomial depending on which bits are set. The max value of the polynomial with all bits set (11111111)hex is x⁸+x⁷+x⁶+x⁵+x⁴+x³+x²+x¹+1. Thus it follows that 25 that has only the first,4th and 5th bit set will be x⁵+x²+1.Multiplication of polynomial for the 3rd row is actually 02 in polynomial format = x * 25 in polynomial format = x⁵+x²+ 1 which gives x⁶+x³+x.Adding of coefficients vertically in polynomial addition is done modulo 2. So x⁶+x⁶= (2modulo2)x⁶ is equal to zero so the power doesnot exist.

Key Addition Layer

The two inputs to the Key Addition layer are the current 16-byte state matrix and a subkey which also consists of 16 bytes (128 bits). The two inputs are combined through a bitwise XOR operation.

SubKey Calculation

As seen below, the first subkey is the original key. The subsequent subkeys are derived from the previous key.

The Function g () rotates its four input bytes, performs a byte-wise S-Box substitution, and adds a round coefficient RC to it. Round coefficient for each round is given Below. Again addition is modulo 2.

RC[1] = x⁰ = (00000001),

RC[2] = x¹ = (00000010),

AES with 192-bit key has 12 rounds and, thus, 13 subkeys of 128 bit each. The subkeys require 13*16Bytes=13*4 words = 52 words, which are stored in the array elements W[0],…,W[51].

Note the number of subkeys is equal to the number of rounds +1.

Decryption Mechanism

Note that the Decryption mechanism has some differences.

The matrix used in InvMixColumn is different for multiplication.

InvShiftRows shifts the rows in the opposite direction.

Inverse S-Box is again a reverse lookup table.

Decryption key schedule is calculated by computing the entire schedule and storing the subkeys.

Now we dive into variations of the AES Cipher and what is recommended in production.

AES ECB

AES ECB is the most naive AES Cipher where it divides the input into blocks of 128 bits each, pads then with a mechanism like #PCKS7[1] and then runs the Cipher.

The problem with this approach is that a block cipher always encrypts the same contents the same way, given the same key.

The most famous example of this is the ECB penguin shown below. The left side is the unencrypted image and the right side is the encrypted one. As you can see the outlines of the image is still preserved.

To encrypt more than 128 bits and to resolve the above problem, a cipher block chaining mechanism can be used used where a randomly generated IV vector is xored with the Plaintext of 128 bits and passed through the AES Block with a key of 128 bits as earlier. The resulting ciphertext is used to also xored with the next plaintext block to derive the subsequent cipher text blocks.

AES CBC Decryption works in reverse and the ciphertext is first decrypted and then either the result of the previous block or the IV is xored with it.

Note there are multiple problems here.

This process is too slow and cannot be parallelized because of the chaining.
Making a random IV is prone to errors and you might end up reusing the IV leading to a weak cipher in practice..
There is no authentication yet. A MITM attack can happen with someone intercepting the traffic.

The third point is solved by adding a MAC [4] like HMAC to the message. This involves a cryptographic hash function and a secret cryptographic key. As with any MAC, it may be used to simultaneously verify both the data integrity and authenticity of a message.

The Ciphertext message including the IV as an input ensures that a change to the input will change the resultant MAC code and thus preserves data integrity.

The cryptographic key forms the second input ensuring that the key is required for generating the MAC and thus preserving authenticity. No one without a key could have generated the message.

AES GCM

A more Modern mechanism is called AES GCM and it uses a combination of a random nonce (similar to an IV above which would be different for each block) and appending it with a counter block. The resultant block is passed through the AES Block and the output is xored with the cipher text block.

The main advantage of this approach is that it is parallelizable as there is no chaining so each CipherText Block can be calculated in parallel.

Again the main disadvantage is the nonce reuse. Let us say the nonce is reused for messages m1 and m2 and one of the plaintexts p1 and ciphertext c1 is known. The second ciphertext c2 is also known but the plain text p2 is not known. Let us assume a single block and the output of the encryption block is b1 and b2. b1 will be equal to b2 as the nonce is reused. An attacker could xor the two ciphertexts together.

So c1 xor c2 = p1 xor b1 xor p2 xor b2 = p1 xor p2 = known.

Now if p1 is known p2 can be calculated.

GCM is also known as a stream cipher because it generates a keystream independent of the plaintext blocks before running the xor.

The second interesting part of AES is the calculation of a (G)MAC.

AES GCM with GMAC. Ek is the GCM Function. H is the GHASH Function for calculating GMAC. From wikipedia.

The lenghts of the cipher text and the Auth Data(Additional Data shortened as AD) are generally concatenated. AD is generally in plaintext along with the IV. An additional data is used to provide context to the data to prevent an attacker from replaying a message sent in one context in a different context. For example, if the confirmation for “Login confirmation” message is reused in a “Buying a Product” confirmation. The different AD prevents these kind of replay attacks.

Note H will always be of size 128 bits irrespective of the original key size. Because of Grover’s quantum algorithm[5] it will take 2⁶⁴ attempts for an attacker to reveal H .If H is hacked an attacker could use H and modify both ciphertext1 or additional data (move some bits from additional data to ciphertext) to come up with the same auth tag. The len(A)||len(C) parameter prevents this from happening.

ChaCha20-Poly1305

ChaCha20 is a stream cipher like AES. Its main advantage over AES is that it is much faster for ARM mobile devices where support for an AES specific instruction set (AES-NI) is missing. Note at the same time it also hobbles up servers which have support for AES-NI universally.

It first arranges the following in a 4*4 matrix of 32 bit words. The first 4 words are a text constant (“expand 32 byte k). The second and third rows are the 256 byte key. Following that a 128 byte counter and 64 byte nonce are placed as shown below.

Assume indexes of the matrix as shown below.

It uses a function called Quarter Round which takes 4 arguments (a,b,c,d) and carries out the below computation.

 a += b; d ^= a; d <<<= 16;
 c += d; b ^= c; b <<<= 12;
 a += b; d ^= a; d <<<= 8;
 c += d; b ^= c; b <<<= 7;

ChaCha does 10 iterations of a double round as shown below. Each double round consists of an odd and an even round.

 // Odd round
 QR(0, 4,  8, 12) // 1st column
 QR(1, 5,  9, 13) // 2nd column
 QR(2, 6, 10, 14) // 3rd column
 QR(3, 7, 11, 15) // 4th column
 // Even round
 QR(0, 5, 10, 15) // diagonal 1 (main diagonal)
 QR(1, 6, 11, 12) // diagonal 2
 QR(2, 7,  8, 13) // diagonal 3
 QR(3, 4,  9, 14) // diagonal 4

Again this is a stream cipher and generates the keystream in parallel of 64 bytes which is then xored with the plaintext blocks of 64 bytes each.[3]

But there is still a problem. Below is the complete ChaCha20 function from wikipedia.

void chacha_block(uint32_t out[16], uint32_t const in[16])
{
	int i;
	uint32_t x[16];

	for (i = 0; i < 16; ++i)	
		x[i] = in[i];
	// 10 loops × 2 rounds/loop = 20 rounds
	for (i = 0; i < ROUNDS; i += 2) {
		// Odd round
		QR(x[0], x[4], x[ 8], x[12]); // column 0
		QR(x[1], x[5], x[ 9], x[13]); // column 1
		QR(x[2], x[6], x[10], x[14]); // column 2
		QR(x[3], x[7], x[11], x[15]); // column 3
		// Even round
		QR(x[0], x[5], x[10], x[15]); // diagonal 1 (main diagonal)
		QR(x[1], x[6], x[11], x[12]); // diagonal 2
		QR(x[2], x[7], x[ 8], x[13]); // diagonal 3
		QR(x[3], x[4], x[ 9], x[14]); // diagonal 4
	}
	for (i = 0; i < 16; ++i)
		out[i] = x[i] + in[i]; // Last Block of addition
}

Note the last Block. Here the Block is added to itself after 20 rounds. This ensures just reversibility of the rounds would not break ChaCha.If this were not there one would just reverse the rounds and get back the actual values of the key if one knows the plaintext at any instant of time.

Poly 1305

Poly 1305 is a universal hashing function which requires two inputs r and k. It does a polynomial multiplication of the message with a high degree polynomial, adds k to the result and computes the result over mod p.

mac = (m1 r⁴ + m2 r³ + m3 r² + m4 r + k) mod p

mac = (r (r (r (r m1 + m2) + m3) + m4) + k) mod p

h = 0;
h += m1; h *= r; h %= p;
h += m2; h *= r; h %= p;
h += m3; h *= r; h %= p;
h += m4; h *= r; h %= p;
mac = (h + k) % p;

This seems to be quite simple, but note we are doing divisions here which are quite expensive in practice.

Let us understand a simple math trick which helps us here. Let us say we want to calculate 10000 mod 9973. This is equal to 27

10,000 mod 9973 = 27
10000 = 9973+27
//Assume a number is multiplied by 10000.
y*10000 = y*9973 + y*27
//Substituting this back to 1
(y*9973 + y*27) mod 9973 = 27
(y*9973 mod 9973 + y*27 mod 9973)mod 9973 = 27
y*27mod9973 = 27

//So multiplying a number by 10000 is equivalent to multiplying it by 27 if we have to 
// module it by 9973

So effectively multiplying 4 digit numbers mod 9973 works as below

              8  7  6  5
           ×  4  3  2  1
           -------------
           |  8  7  6  5
        16 | 14 12 10  .
     24 21 | 18 15  .  .
  32 28 24 | 20  .  .  .

This is equivalent to the below. multiplying by 1000 is treated as multiplying by 27.

 8   7   6   5
×  4   3   2   1
----- --- -- ---
   8   7   6   5
  14  12  10   .
  18  15   .   .
  20   .   .   .
             432
         648 567
     864 756 648

Generalizing this to binary arithmetic: when we move the overflowing numbers to the right after 130 bits ( multiple of 26 bits= length of a0, a1, ..,.. a4 and of b0 b1, b2,..b4) , we need to multiply them by 5 to compensate.

Each of these multiplications can also be done in parallel using SIMD.

     a4      a3      a2      a1      a0
×    b4      b3      b2      b1      b0
---------------------------------------
  a4×b0   a3×b0   a2×b0   a1×b0   a0×b0 
+ a3×b1   a2×b1   a1×b1   a0×b1 5×a4×b1
+ a2×b2   a1×b2   a0×b2 5×a4×b2 5×a3×b2
+ a1×b3   a0×b3 5×a4×b3 5×a3×b3 5×a2×b3
+ a0×b4 5×a4×b4 5×a3×b4 5×a2×b4 5×a1×b4
--------------------------------------
   p[4]   p[3]   p[2]    p[1]      p[0]
// Here each a is 2 26 bit integer.

In C Code,

void mult(uint64_t p[5], const uint32_t a[5], const uint32_t b[5])
{
    uint64_t a0 = a[0];  uint64_t b0 = b[0];
    uint64_t a1 = a[1];  uint64_t b1 = b[1];
    uint64_t a2 = a[2];  uint64_t b2 = b[2];
    uint64_t a3 = a[3];  uint64_t b3 = b[3];
    uint64_t a4 = a[4];  uint64_t b4 = b[4];
    p[0] = a0*b0 + a1*b4*5 + a2*b3*5 + a3*b2*5 + a4*b1*5;
    p[1] = a0*b1 + a1*b0   + a2*b4*5 + a3*b3*5 + a4*b2*5;
    p[2] = a0*b2 + a1*b1   + a2*b0   + a3*b4*5 + a4*b3*5;
    p[3] = a0*b3 + a1*b2   + a2*b1   + a3*b0   + a4*b4*5;
    p[4] = a0*b4 + a1*b3   + a2*b2   + a3*b1   + a4*b0  ;
}

Here is how the carry function works. The carry is propagated until there is no carry.

uint64_t propagate_carry(uint64_t p[5], uint64_t carry)
{
    p[0] += carry * 5;  carry = p[0] >> 26;  p[0] -= carry << 26; // For the first iteration, carry will be zero.
    p[1] += carry    ;  carry = p[1] >> 26;  p[1] -= carry << 26;
    p[2] += carry    ;  carry = p[2] >> 26;  p[2] -= carry << 26;
    p[3] += carry    ;  carry = p[3] >> 26;  p[3] -= carry << 26;
    p[4] += carry    ;  carry = p[4] >> 26;  p[4] -= carry << 26;
    return carry;
}

Credits to Loup Valiant for this implementation. [2]

This is how the combined ChaCha20 Poly 1035 looks. Again the lengths are appended for the same reason as in GHASH.

Disk Encryption (AES-XTS)

AES-XEX (Xor- Encrypt-Xor) with XTS (CipherText Stealing) is a popular mode of Disk Encryption supported by Software like TrueCrypt.

This is how AES-XES looks from Wikipedia. As earlier it is a Stream based Cipher so the Encryption of the Disk Blocks can be done in parallel.

Apart from this it uses two keys.

One is a master key key2 which changes once per sector. This is called a master tweak ensures plaintexts on two different sectors dont end up with the same ciphertext.

Second is a key key1 which changes once per Block and is called a subtweak..

In terms of formulae

Ek(I) is the Encryption Function which is multiplied by the polynomial alpha ^j to make it different for each Block.

This Result X is then Xored with the Plaintext in parallel

The result of the Xor is then encrypted by the Block Cypher and again xored with X to give the cipher text C

Why not use AES CBC?

Because if a Block is corrupted all further sequential blocks would not be recoverable.

Why not use AES GCM + GMAC?

Random Nonce doesnt serve any purpose. No requirement for an authentication code which is chain calculated and has the same issues as AES CBC.

This has a problem of large overhead because the last block is often padded to be a multiple of the Block size. So in practice an approach called CipherText Stealing is used which mixes the last two blocks

[1]In #PCKS The value of each pad byte is the total number of pad bytes that are added. Of course, the total number of pad bytes depends on the block size.

For example the padded bytes are shown below in bold.

0x10 0x11 0x36 0x67 0x38 0xBC 0x03 0x21 0xEF 0x03 0x03 0x03. Other mechanisms are shown here. http://www.crypto-it.net/eng/theory/padding.html

[2] https://loup-vaillant.fr/tutorials/poly1305-design

[3] Relative to just 16 bytes for an AES keystream.

[4] https://medium.com/aws-in-plain-english/how-aws-s3-requests-are-authenticated-hmac-sha256-in-depth-87c32fb4fc9a

[5] https://en.m.wikipedia.org/wiki/Grover’s_algorithm

[7] My reference for the mathematically inclined would be this reference from Christof Paar

Understanding Cryptography: A Textbook for Students and Practitioners https://amzn.in/d/daXUjwI