How to distinguish between gibberish and valid English text

Vaibhav Garg
Analytics Vidhya
Published in
5 min readApr 14, 2021

The world of Text Fitness Functions

In the popular trope of hacker movies, there is sometimes an encrypted text that needs to be decrypted to save the day. The protagonist does some quick shenanigans on their computer, and out pops the decrypted text; at the very last moment, just in time to save the world.

It would not be wise to take these ideas seriously as the technology shown in movies and shows is often way off the mark, laughably infeasible and over dramatized — but that would be a topic for another day.

However, if you were to write a program to decrypt an encrypted piece of text, you definitely do need to know when you are done. More specifically, the computer needs to know that! As humans, we can automatically parse legitimate text from gibberish, but how do you transfer those discerning powers to a high-speed moron (your computer)?

This was a lingering doubt for many years now in the back of my head when I stumbled upon this excellent video on the Computerphile channel on YouTube, which introduced me, as a tangent, to the world of text fitness functions.

According to Wikipedia, a fitness function is a particular type of objective function used to summarize, as a single figure of merit, how close a given design solution is to achieving the set aims.

Concretely, as applied to text, it summarizes how close the statistical properties of a particular piece of text are to similar texts in the language under consideration. This is an entire field, but what I would like to do is to introduce you to two of the fitness functions that I replicated in Python.

Index of Co-Incidence

From Wikipedia:

The index of coincidence provides a measure of how likely it is to draw two matching letters by randomly selecting two letters from a given text. The chance of drawing a given letter in the text is (number of times that letter appears / length of the text). The chance of drawing that same letter again (without replacement) is (appearances — 1 / text length — 1). The product of these two values gives you the chance of drawing that letter twice in a row. One can find this product for each letter that appears in the text, then sum these products to get a chance of drawing two of a kind. This probability can then be normalized by multiplying it by some coefficient, typically 26 in English.

where N is the length of the text and n1 through nc are the frequencies (as integers) of the c letters of the alphabet (c = 26 for monocase English). The sum of the ni is necessarily N.

The important point is that the IoC for English text is close to 1.7, whereas the IoC for random text should be close to 1.

The first step is to get a random text and some valid text.

The text needs to be prepared for analysis. Preparation refers to stripping out all punctuation, spaces, and special characters; and setting everything to a uniform case (lowercase as chosen here).

def prep_str(iptext):
iptext = iptext.lower()
iptext= iptext.replace(".","")
import string

for punc in string.punctuation:
iptext = iptext.replace(punc,"")

iptext = iptext.replace("’","")
iptext = iptext.replace(" ","")
iptext = iptext.replace("\n", "")
iptext = iptext.replace("…","")

for i in range(10):
iptext = iptext.replace(str(i), "")

return iptext

To calculate the IoC, the helper function above shall be utilized.

def calc_ioc(iptext):
’’https://en.wikipedia.org/wiki/Index_of_coincidence'''
iptext = prep_str(iptext)
cnt = Counter(iptext)
sum = 0
N = len(iptext)
for x in cnt:
#print(f”{N} {cnt[x]}”)
sum = sum + cnt[x]*(cnt[x]-1)
return sum*26/(N*(N-1)), cnt

The results are in, and they are promising:

print(calc_ioc(randomtext)[0])
print(calc_ioc(teststring)[0])
1.0002113411341134
1.7642129827559332

This code can clearly distinguish between gibberish and English language text.

Quadgram Fitness calculation

Based on the n-Gram data released by Google books under a permissible license, we can calculate the relative frequencies (and therefore, probabilities) of any n-long sequence of letters. If n=4, the n-gram is called a quadgram.

Quoting from this link:

Having these frequencies, technically, we can estimate the probability to find given text in whole text corpus (which is a good candidate for fitness measure). For example, let our text be the word “MENTION”. It consists of the following quadgrams: MENT — ENTI — NTIO — TION. So,

The real problem here, however, is that the probabilities are quite small, so multiplication of those quickly goes to even smaller values, introduces rounding errors, and is not quite usable. The solution is known — apply the logarithm function.

I found a data file on Github that already contained the log10 probabilities of the Quadgrams, and I read it in.

Here is what the contents of the file looked like:

Uh Oh! These are in upper case, need to handle that...

Note that the log probabilities are all negative since the probability will always be between 0 and 1, and its log10 will be negative.

def quadgram_fitness(iptext):
a = prep_str(iptext)
quadtext = [a[idx:idx+4] for idx in range(len(a)-3)]
quaddict={}
with open("quadgrams.txt") as f:
for line in f:
quaddict[line.split(",")[0]]= float(line.split(",")[1])

sum = 0
for quad in quadtext:
sum += (quaddict.get(quad.upper(),0))
return abs(sum)/len(quadtext)

Running this over the two strings above, we get:

print(quadgram_fitness(randomtext))
print(quadgram_fitness(teststring)) #larger the better, as this is the absolute value of the sum of the log likelihood

0.5542850898467538
3.7183183502414807

Bingo!

Needless to say, there are many many more text fitness functions that you can implement and play with. Depending upon the use case, you can also think of having an ensemble of the fitness functions appropriately weighted to give your predictions more confidence.

The code for the above experiments is fully self-contained and is available here:

The Quadgrams data file is available here:

--

--