Meme stash part 2: OCR preprocessing

Péter Hatvani.
5 min readSep 25, 2022

--

In this part, OCR stands for Optical Character Recognition this technique is widely used to read texts from codexes as can be found in part 1.

So we start with the image:

WONGA MEME

Most of the memes have this distinct white font and we want to fix it. Applying inversion of colours we get the text next to readable for humans but at the cost of having surfaces. Having surfaces would not be a big deal until tesseract — my OCR of choice wouldn’t throw a hissy fit for it, spewing out garbage as result.

First I wanted to fix the garbage characters, for the grayscale image I got:

MM Bou, YOU WROTE AN ARTICLE ||\n5 ABOUT MEMES? k\n. i\n\nay Bos\n\n2\n\nPLEASE, TELL ME MORE ABOUT HOW\nORIGINAL YOU ARE\n\n \n\x0c

The numbers and the new lines are uncalled for. Tesseract luckily enables the users to set custom settings such as whitelisting characters. This is the current setting I use. The flag OEM means to use the legacy or the LSTM (Long-Short Term Memory — a type of machine learning algorithm) character recognition engine should the LSTM be available. The PSM orders tesseract to find as much text as possible, for there are no safe assumptions to be made about where text will be on memes and in what format they will be in.

"--oem 3 --psm 11 -c tessedit_char_whitelist='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

Next up we try to get rid of the surfaces to make the life of tesseract easier.

Binary inverse thresholding with morphological opening

When I saw the white text and the vibrant background my first thought was: just simply invert the colours with a threshold and we are done. That method did not work out as planned and the text was too fragmented to gather meaningful tags from them. Naturally, I searched for image processing algorithms. The first I found was the morphological transformations: they are used to find shapes in images.

A transformation is made up of a kernel and an image. The kernel is a moving grid that iterates over the image during the transformation, usually thresholding the pixels, thus implementing a dynamic thresholding solution. There are many transformations: most notably erosion, dilation and their permutations like opening and closing. You can read more about transformations on this OpenCV reference page.

Binary inversion with Otsu’s method

Another powerful preprocessing step is Otsu’s method. This adaptive thresholding technique establishes a single intensity threshold which can distinguish between foreground and background. Both of these methods with the Gaussian also work on images converted to grayscale, removing the colour and only leaving the intensity.

Gaussian adaptive thresholding

Gaussian adaptive thresholding —making an educated guess from the name — operates on an image with a predetermined kernel. The technique computes the weighted sum of the kernel, which is cross-correlated with a Gaussian window function and substracts the standard deviation.

Gaussian adaptive thresholding with Otsu

The Gaussian thresholding was promising, but the breakthrough happened when it was combined with Otsu’s method.

Stroke Width transformation-for grayscale text

After a bit of looking, I have found another method that yields great results. Unfortunately, this transformation, the stroke width transformation, excels in images with black text and white background, see the previous link for amazing results. With this, I had 3 methods for the bright background and white text and one for white background and black text.

Stroke width for black and white

Merging the found texts.

After replacing the new line characters with spaces and stripping the strings we get words and words written together from the preprocessed images. I have used a funnel, as every transformed/preprocessed image went through OCR. Gathering all strings from the process the basic text processing steps took place: striping whitespaces, splitting words and splitting words written together.

Splitting words written together

This was one of the hardest challenges so far. I have tried searching for real English words from corpora, I have used nltk’s brown and the words corpi, using wordninja and lastly enchant.

The first try: my hypothesis was that, if you have a million words in a corpus you can choose the 10'000 most common ones that are not stopwords (the most common English words that are not useful in searching e.g.: a, an, he, she). This hypothesis soon fail as the search for valid words soon returned single letters and short incomprehensible strings that WERE part of the corpus but doesn’t qualify as a valid word. For example, the possessive <‘s> is considered a word in the corpus and it was found many times for every image.

Second run: wordninja is a statistical word slicing tool, it comes with a corpus that can be edited/expanded but the words in its dictionary have to be proportional to the probability of the words' occurrence. This was the fastest solution but the dictionary prefered too profound language for the memes and thus words like enlightenment — which can be a keyword were not found.

Final solution: Using the spellchecking module pyenchant. I use this tool with a little modification. To find the longest valid word I iterate from the end to the start of the string and add the found word, continuing from the end of the string.

def _split_long_string(long: str, dictionary) -> list[str]:    length = len(long)    words = set()    longest = 0    for i in range(0, length - 2):        if i < longest:            continue        for j in range(length, i + 1, -1):            if dictionary.check(long[i:j]):                words.add(long[i:j])                longest = j                break    return words

At the end of the module, I collect every word to one HashSet and search for a synonym for it to broaden the tags. Synonyms are found in WordNet.

In the next part, I will describe the object detection and face/emotion detection of the tagging process.

--

--