Cracking Caesar Cipher with Frequency Analysis and the implementation in Python

In 9th century, an Arab philosopher named Al-Kindi pioneered the field of cryptanalysis. Particularly in the context of breaking substitution cipher using frequency analysis method.

5 min readJul 28, 2023

After my previous writing about Caesar Cipher, this story will cover how to crack it using Frequency Analysis made by Al-Kindi in the 9th century.

Al-Kindi the father of cryptanalysis

Illustration of scholar. Photo by Thomas Kelley on Unsplash

In his work “Risalah fi Istikhraj al-Mu’amma” (Treatise on Deciphering Cryptographic Messages or also known as A Manuscript on Deciphering Cryptographic Messages), written in the 9th century, Al-Kindi pioneered the field of cryptanalysis and published various cryptographic methods including the frequency analysis method to crack substitution ciphers.

His work in the field of cryptography and cryptanalysis laid foundation of cryptanalysis and earned him the title “father of cryptanalysis” in recognition of his pioneering efforts to break ciphers and study the methods of encryption and decryption.

Frequency Analysis to crack substitution cipher

Frequency analysis is a technique to break simple substitution ciphers such as Caesar Chiper. The method relies on the fact that in any languages, certain letters appear more frequently than the others. For example, in english, the letter ‘E’ is the most commonly used letter. We will proof this in the implementation section.

In the context of Caesar Cipher, we will be able to determine the shift distance of letters after mapping the most appearing letters in the message. For example, if the most appearing letter in an encrypted english message is the letter ‘G’, then we can conclude the encryption has the shift distance of 2. Because of the fact that the most appearing letter in english is ‘E’ and the most appearing letter in the message is ‘G’, and letter ‘E’ and ‘G’ has the distance of 2 letters.

Implementation of frequency analysis in Python

In the implementation of frequency analysis method using Python, I will use a dictionary using the alphabets as the key and its number of appearances as the value.

def frequencyAnalyzer(str):
    dict = {}
    # going through each character of the string
    for c in str:
        # if the character is an alphabetic, we will count the appearances.
        if((c >= 'a' and c<='z') or (c >= 'A' and c<= 'Z') ):
            c = c.lower()
            
            if(dict.get(c) != None):
                dict[c] +=1
            else:
                dict[c] = 1
    
    return dict

    # use this line if you want a sorted list
    ## return sorted(dict.items(), key=lambda x:x[1],reverse=True)

let’s test the function using a simple string

print(frequencyAnalyzer("testing with UPPERCASE"))
ouput : [('t', 3), ('e', 3), ('s', 2), ('i', 2), ('p', 2), ('n', 1), ('g', 1), ('w', 1), ('h', 1), ('u', 1), ('r', 1), ('c', 1), ('a', 1)]

The output would be a list of tuples with the letter and the number of appearances.

Analyzing a well known english litterature

To proof the fact that the most appearing letter in english language is ‘E’, I downloaded Romeo and Juliet from Hamlet in Project Gutenberg in a txt file and then analyze it with the same algorithm.

Here is the function to read the txt file:

def txtToString(file):
    f = open(file, "r")
    str = f.read()
    
    f.close()
    return str

And here it is how to use it to count the letters appearances:

romeoAndJulietWordsFrequency = frequencyAnalyzer( txtToString("romeoAndJuliet.txt"))
print(romeoAndJulietWordsFrequency)

output:
[('e', 14827), ('t', 11325), ('o', 10242), ('a', 9369), ('i', 8106), ('r', 7755), ('n', 7628), ('s', 7456), ('h', 7351), ('l', 5530), ('d', 4499), ('u', 4260), ('m', 3693), ('y', 2992), ('c', 2970), ('w', 2892), ('f', 2429), ('g', 2289), ('p', 2039), ('b', 1995), ('v', 1240), ('k', 980), ('j', 383), ('x', 158), ('q', 76), ('z', 35)]

As you can see, the most used letters in Romeo and Juliet is the letter ‘E’.

Using this function, we can show it in a histogram sorted by the number of the letter appearances

import matplotlib.pyplot as plt

def plot_histogram(dictionary):
    sorted_items = sorted(dictionary.items(), key=lambda item: item[1], reverse=True)
    keys = [item[0] for item in sorted_items]
    values = [item[1] for item in sorted_items]

    plt.bar(keys, values)
    plt.xlabel('Characters')
    plt.ylabel('Number of appearances')
    plt.title('Histogram of the most used letters in Romeo and Juliet')
    plt.show()

as we can see, the letter ‘E’ is the common used, followed by letter ‘T’, ‘O’, etc.

Frequency Analysis to crack Caesar Cipher

With the mapping from Romeo and Juliet above, we could try to decipher any english text with a random shift distance. The encrypt and decrypt function can be obtained in my previous article regarding Caesar Cipher

# choosing a random number to put as shift number to verify this methods works 
# with whatever the distance is.
n = random.randint(0,10000)
# encrypting the message with the distance of n

# original text message
text = "I am currently working on an article about classic cipher methods, are you interested ?"
# uncomment below line if you want to use 
# Kim by Kipling from https://www.gutenberg.org/ebooks/2226
# text = analyzer.txtToString("Kim - Kipling.txt")

encryptedMessage = encrypt(text, n)

# obtaining the letters frequency from both texts
encryptedMessageWordsFrequency = frequencyAnalyzer(encryptedMessage)
romeoAndJulietWordsFrequency = frequencyAnalyzer( txtToString("romeoAndJuliet.txt"))

# we are looking for the distance between the most frequent letter from our text 
# and the most frequent letter in english language (sample obtained from Romeo and Juliet)
nInversed = abs( ord(encryptedMessageWordsFrequency[0][0]) - ord(romeoAndJulietWordsFrequency[0][0]))

# decrypt using the key calculated from the previous line:
print(decrypt(encryptedMessage, nInversed))

print("the shift distance obtained from a randomizer: ", n)
print("original message: ", text)
print("the encrypted message: ", encryptedMessage)
print("the distance calculated: ", nInversed)
print("the decrypted message: ", cesar.decrypt(encryptedMessage, nInversed))

output:

the shift distance obtained from a randomizer:  8865
original message:  I am currently working on an article about classic cipher methods, are you interested ?
the encrypted message:  H zl btqqdmskx vnqjhmf nm zm zqshbkd zants bkzrrhb bhogdq ldsgncr, zqd xnt hmsdqdrsdc ?
the distance calculated:  1
the decrypted message:  I am currently working on an article about classic cipher methods, are you interested ?

Note: in this example, the decryption would work only if the most common letter in the original message is the letter ‘E’. This example is not perfect but you get the idea of how frequency analysis method works.