Voila! Getting rid of Tesseract Failure Cases!

Train2Test

Published in

The Startup

6 min readJun 13, 2020

Anyone who ever wanted to perform Optical Character Recognition (OCR) must have heard about Tesseract.

Why? Well, cause, one, it was developed by Google; two, it is state of the art and what’s more? It’s free!

So why are we even writing this post? Why are you guys even reading it? Well, that’s because even Tesseract has some major drawbacks.

Tesseract works at its best when the images are of a document or are organized as a document.

This is what we mean by the above statement. Tesseract thrives when there is constant font size/type, constant line/word spacing, no special characters: all these make an ideal scenario for Tesseract to work. But, many times OCR needs to be performed on images that aren’t so well organized. They have multiple fonts and font sizes, unequal line spacing, special characters, etc. This is where Tesseract might fail.

The Standard Approach: EAST API along with Tesseract

Text Detection: Efficient and Accurate Scene Text Detector is the best text detection algorithm available to date. Trained on thousands of images, it is known to yield fast and accurate text detection in natural scenes. It creates a bounding box around the detected text.

Text Recognition: Pytesseract is a python wrapper for Google’s Tesseract-OCR Engine. It translates the text in an image to characters.

Adrian Rosebrock has written a beautiful tutorial on how to implement EAST along with Tesseract.

We implemented this combination on some images. Sample Test Results:

Top: Image Bottom: EAST and Tesseract Detection

Left: Image Right: EAST and Tesseract Detection

Quite evidently, if there is less noise in our images, or if the font type and size are consistent, EAST+Tesseract works well (first two images).

But as you see, Tessearct+EAST gives gibberish results for the last two images.

While playing around with more such sample images, we noticed if the bounding box was detected correctly, the text was detected correctly.
Implying, Tesseract worked if EAST worked. Tesseract failed where EAST failed.

We thought of completely eliminating the bounding box from the equation i.e, eliminating EAST.

That means now we were left with two options: either find another deep learning approach to detect text or proceed without the bounding boxes.
For the time being, we decided to move forward with the latter.

Tweaking Tesseract

In another blog of his, Adrian Rosebrock has explained how to use just Tesseract for text recognition. Dive right in if you haven’t set up Tesseract on your system yet!

Pytesseract command looks like something like this:

The parameters under “config” can be modified in accordance with the use case.

Thumb Rules to take care of while working with Tesseract:

Language
Tesseract can detect over 130 languages and over 35 scripts. Remember to specify the language you want to detect in the config command. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639–2 language codes.
OCR Engine Modes
Tesseract has several engine modes with different performance and speed. Tesseract 4.0 has introduced additional LSTM neural net mode, which often works best. Make sure you’re using this mode!

Page Segmentation Mode
Know the data on which you wish to test. Set Tesseract to only run a subset of layout analysis and assume a certain form of an image.
Trust us, a simple tweak really improves the results!

Digits Only
Do all your images consist of only digits? You can make Tesseract detect just that with this code:

Though our use case was just digits, this modification didn’t seem to the wonders to the results.

While trying the image just on tesseract, it showed promising results early on. Images that were not at all detected in Tesseract+EAST were somewhat detected by tesseract. It was certainly a better approach :)

Tweaking the Images

But the good times didn’t last long.
We were able to spot inconsistencies in this approach too. So, in the search for better results, we decided to use some image pre-processing techniques:

Binarization
Padding

Padding

In some images, numbers covered the majority of areas like in the image given below. So to bring numbers in the center of the image we decided to use padding. It did increase the performance, but the amount of padding to be used on each image(for the image to be detected) was variable.

Results after Padding

Binarization

We decided to use binarization as it would increase the contrast between the text and the background.
How is that possible?
By converting our color images to black and white images.

Skimage (code here) has more than 10 types of binarizations. You can try out a couple and choose the one that best fits your use case.
(Binarizations work best on grayscale images, so first and foremost convert your image to grayscale).

Take a look at all of them on a sample image here!:

A binarization algorithm that works on one image might not work for another. We really recommend that you try each type of binarization on your use cases.

Quite evident, the global threshold works on the first image but not on the second. It all depends on how your original colored image is.

The results are slightly better in some cases:

Quite evidently, the main price, $5.99 wasn’t detected in both the cases below, most probably because of the unequal font size. “$3.00 per L”, despite being a small font was accurately detected in the binarized version!

But again, binarisation doesn’t work always:

Conclusion

Though Tesseract has inbuilt commands to preprocess the images, carrying out binarisation and padding before sending it to the model gives improved results sometimes.

But again these methods aren’t completely foolproof.
Perhaps, there is no way possible to use pre-trained Tesseract (with or without EAST) to give consistent results on noisy images.

Have you experienced similar issues with Tesseract? Have you come up with your own hack of dealing with it?

We’d be happy to hear about it in the comments!

This blog was written by Nikita and Naitik.