Nancy the cavewoman and frequency analysis-1

Hemanth Chitti
The Fun Of Cryptography
7 min readJun 12, 2020

We talked about monoalphabetic ciphers in our last post ( https://bit.ly/2MddkzM) . And now we are going to discuss how to break them.

Necessary reading — https://bit.ly/2Xf2yPO

What we know about monoalphabetic ciphers is that they are just basically corresponding the alphabet of a language to a permutation of it. This means that in the worst case a brute force attack would take a really really long time to execute. And this was much better than a Caesar cipher.

But we also mentioned that there was a weakness to monoalphabetic ciphers. In fact, this is a pretty standard attack against classical ciphers in general(more on what those mean later). The weakness is that,

Monoalphabetic ciphers do not change the structure of the words one bit.

As always, we look at an anectode to explain.

Meet Nancy.

Cavewoman with stone computer
Nancy, the tech-savvy cavewoman (Source — friendlystock)

Nancy is a cavewoman from 25000 BC who was frozen in an ice slab. She was discovered by the military in 2000. To their surprise, she’s still alive and running (hehe this idea is totally original). Of course, she become an instant celebrity.

While she has no plan of fighting supervillains, she wants to learn more about the world. She’s told she has to get educated for that. So she decides to sit down and learn how to read and write for that.

However, the instructor assigned to teach her how is jealous of all the attention she gets. And if she actually completes a valid education and God-forbid a degree, she’d become the first cavewoman in all of history to get a Bachelors degree! “No, this won’t do. I’m going to have to nip her education in the bud.”, he says to himself.

So he gets the alphabet books and is told to teach her the basics. What he does instead is to mix up the positions of the letters. So if we learn the alphabet as “ABCDEFGHIJKLMNOPQRSTUVWXYZ” she learnt it as “BACEDIGHFMLKJQOPMUSTRZWYVX” instead. However, to be fair, he did teach her some basic speaking so that he wouldn’t fully fall under doubt.

Thus though she is sincere about her education, she fails even the most basic things because she can’t read. The scientists think that she is stupid and that maybe cavewomen can’t get educated after all so they discontinue her education.

The day she goes to the military base to get formally terminated, she sees them trying to crack some monoalphabetic cipher (please assume that this is some noob military xD). The cipher is IOOKFSH_JFKFTBUV.

But she doesn’t understand what the problem is! She immediately reads it out loud as “FOOLISH_MILITARY”! Everyone thinks it is an insult but they quickly realize that she is able to read it off the wall as if it were normal English. Soon they discover the instructor’s conspiracy and fire him, making sure that he never gets a job again.

Now Nancy has to learn how to read from the start- moreover she has already assigned a meaning to those symbols in her head so it will take much more effort to rewire all those brain connections again. But she has slowly become a voracious learner — she keeps bothering everyone for material to read. Until they complete her re-education, the military decides to translate the words into her alphabet. They assign this task to their cryptographers, hoping that by observing Nancy’s language they learn something vital to breaking further such ciphers.

While in the process of translating they realize something amazing. They write down their formulations and observations as below:

The Frequency Theory:

(i) An alphabet is a set of symbols in a fixed order. Combining these symbols in different ways gives rise to words which build a language. An alphabet of a given language also has a fixed size, and this size is a fundamental characteristic of a language. As in, if a language is known to have an alphabet size of n, then any alphabet used to represent words in it must also have size n.

(ii) A letter is defined to be a position in the alphabet, and it is represented by different symbols in different alphabets. Thus the letter is not tied down to the representation used by a particular alphabet- it is again a fundamental characteristic of a language.

We denote a letter by bold text. For example we represent the first letter in English as A. We represent this A by the symbol ‘A’ and Nancy represents it by ‘B’. The symbol used to represent it has changed but the letter itself has not. A letter is thus a conceptual element tied to a language itself.

(ii) Monoalphabetic ciphers do not modify the structure of a word. The letters appear in the same order in any alphabet, and only the symbols change. This is how Nancy reads and understands the cipher immediately; though the word IOOKFSH is gibberish to us, it is perfectly sensible to Nancy because the spelling, or the order in which letters appear, has not changed. When we convert the symbols to our alphabet the word makes sense to us.

(iii) There is a certain statistical distribution of letters in text, as follows:

So now observe that there is a certain distribution of letters in text.

Unigram distribution of letters in English represented by bar graph
Distribution of letters in the English alphabet

‘E’ is the most commonly used letter in the English alphabet, as seen from the above graph. After that comes ‘T’, and then ‘O’, and so on.

Another example from previous posts is that in English words, after ‘q’ you always have the letter ‘u’. So if I suspected a letter in the ciphertext to be ‘q’, I would know the letter after it. In other words, you can deduce an extra letter just by knowing one from understanding the patterns found in the English language!

To go further, you also have that the most common combination of two letters to be “th”. We will look more on this in the next post.

Statistically speaking, in a long enough sentence the letters would appear with more or less the same frequencies as shown above. So the most common letter would correspond to ‘E’, and the second most would correspond to ‘T’, etc.

Remember how I said in the post on linguistics how the simple fact that the text has to remain meaningful on decryption, implying that it has to follow the rules of linguistics, preserves the structure of the text and thus makes the ciphertexts vulnerable to attack? This is what I meant.

This blog post is long enough to carry out a successful attack*, so we can use it here as an example.

Nancy doesn’t have to learn our alphabet to read this blog post. She can use this process to read it. Following this, she’d process the article and note down frequencies of each letter. The most common letter she finds here would most likely correspond to Nancy’s ‘E’ (in other words, whatever our ‘E’ corresponds to in her alphabet). And she could map all the letters similarly.

Now someone could say, “Hey, this attack has flaws! Look at the example word you gave above which you said Nancy decrypted, ‘IOOKFSH_JFKFTBUV! You get the wrong answer on performing this attack, so this entire thing fails there itself!”

My answer is that this is only statistically true. Notice how I mentioned that this blog post is long enough to carry out a successful attack, and that is why we proceeded with confidence. That is because the more you increase the size of the plaintext, the more the frequency ratios approach the bar graph I attached above. So with this blog post,it is likely to be true, and with an entire book, it would be very likely to give the correct answer. Notice how I say ‘likely’ in both cases ;I can never say it with 100% confidence unless the plaintext is of infinite size.

So if you give me just a few words, it might not be true. The same case given above gave the wrong answer when I put it in a substitution tool at https://quipqiup.com/.

The most likely plaintext from the analysis was shown to be “FEELING_MILITARY”, while the second most likely was indeed “FOOLISH_MILITARY”.

Now, this is where contextual clues come into play. Seeing that the message was FOOLISH_MILITARY, it would appear that the message came from the enemies of the Armed Forces. In that context, FOOLISH_MILITARY makes way more sense to be a message from their enemies than FEELING_MILITARY. Moreover I don’t even know what it means to feel military. Sure, the frequency analysis returned it as the most probable plaintext but when we consider the context in which the message has come, the probability of the second most likely one being the true message increases.

Thus frequency analysis is not a magic fix to everything and we have to combine it with a contextual analysis as well to decipher the text. Sure, even this combination doesn’t always work, but even if we don’t recover the entire text, recovering a small portion is enough for many attackers. Even getting one character of your password is a huge security breach, and smart attackers can leverage this to extract even more information (which will be explained in further posts).

So monoalphabetic ciphers are definitely not good enough for the real world. But the tool used against it, frequency analysis, can be used against many more than just monoalphabetic ciphers, which is why we will devote an additional post to it to understand it better.

*- I used the monoalphabetic substitution tool at dcode.fr to encode this entire blog post with Nancy’s alphabet and obtain a ciphertext. Then, while I could use dcode.fr itself to decode it, for simple monoalphabetic cipher decoding I like quipqiup so I copy-pasted the ciphertext into it. I was returned the correct answer, and unlike with the 2 words long ciphertext example above, there was only one plaintext returned. That is because with such a long ciphertext, the other possible alphabets it could have been encrypted return a high number of non-English words and are discarded while only the correct one could give a meaningful plaintext.

--

--