Tesseract Assisting a Smart OCR

Published in

DiveDeepAI

6 min readJul 1, 2021

What is Optical Character Recognition (OCR)?

OCR, or Optical Character Recognition, is a process of recognizing text inside pictures and changing it into an electronic structure. These pictures could be of printed text like documents, handwritten content, receipts, screenshots, images containing text, name cards, etc.

OCR has two sections. The initial segment is text recognition where the text-based part inside the picture is resolved. This restriction of text inside the picture is significant for the second part of OCR, text acknowledgment, where the content is separated from the picture. Utilizing these procedures together is how you can separate content from any picture. In any case, nothing is perfect and OCR is no exemption. Nonetheless, it has become conceivable to improve and more generalized answers for this issue with the advent of deep learning and artificial intelligence.

Techniques for text detection before the Deep Learning Era

a. SWT (Stroke Width Transform)

b. MSER (Maximally Stable Extremal Regions)

Both the techniques were not clear, in the first strategy there were places in the image where there were no writings still it stamped them with boxes. Likewise in the subsequent technique, the writings were not appropriately distinguished.

Using OpenCV for text Detection

Text detection utilizing OpenCV is the exemplary method of getting things done. You can apply techniques like picture obscuring, thresholding, morphological tasks, resizing, etc. to clean the picture.

The outcomes were accomplished with the least preprocessing and contour detection followed by text acknowledgment. Clearly, the contours didn’t distinguish the content every time. Still, doing text detection with OpenCV is a dreary undertaking requiring a ton of messing with the parameters. Additionally, it doesn’t do well as far as a generalization.

Tesseract OCR

Tesseract is an open-source OCR Engine. It is one of the best scarce and free OCR Engines accessible today. It’s freeware accessible under the Apache License. The most recent version(v4) of OCR utilizes man-made brainpower for text acknowledgment. It utilizes the LSTM (Long Short-Term Memory) calculation, which depends on Neural Networks' rationale. It at present is backing the recognition of the contents in at least 100 dialects.

Tesseract has API interfaces for C++ and python. We can utilize pytesseract to execute OCR on pictures. The yield of the process is then put away in a text file. To incorporate Tesseract in Python code, we will utilize Tesseract’s API (Pytesseract). Pytesseract is a covering for the Tesseract-OCR Engine.

Tesseract is generally used with Pytesseract (Python wrapper for Tesseract OCR Engine) and OpenCV (Open-source Computer Vision Library).

Obtain digitized data using Tesseract

Some measure of image processing may be needed for getting the best out of tesseract. We may need to De-Skew the picture, eliminate noise, commotion, De-Color the picture, or apply Brightness/Contrast depending upon the kind of pictures given.

For easier reading of the image, the image should be modified to a minimum of 300x300 dpi. The ‘image_to_string()’ and the ‘image_to_data()’ functions function admirably if the size of the tiniest letter in the picture is somewhere around 20 units in height. The ‘image_to_data()’ function contains a segment that shows the height of each character/word that is perused. If the word size is too little or too enormous the picture size should be diminished or expanded to get the middle word size someplace close to 20 for getting precise content information from the picture.

One may likewise require Edge-Detection Algorithms for identifying the edges of the words or characters for the tesseract to translate the picture more precisely. All these image processing techniques go under various portions. The OpenCV library gives us the APIs to all the image processing functions referenced previously.

Challenges for block/column-based data

Tesseract isn’t always ready to detect text across blocks and columns. It will consistently attempt to join text across blocks set way separated. If the picture is slanted, skewed, or contains scattered discontinuous content or various words with various textual style and text dimensions, as seen on bills or clinical reports, where the organization/clinic name is imprinted in a greater textual style than the substance of the bill or report, then, at that point, the information got from the ‘image_to_string()’ work is unaligned and confused up.

Solution

In such a cases, try to utilize both ‘image_to_string()’ function and ‘image_to_data()’ function to get the right data.The method is to utilize the top and left coordinates from the ‘image_to_data()’ matrix and match it with the initial not many expressions of the ‘image_to_string()’ output and afterward adjust the sentence in the ‘image_to_string()’ considers the top upsides of words in the ‘image_to_data()’ output. This little trick assists us with addressing the muddling of information in situations where the picture contains blank columns.

Tesseract and its Page Segmentation Modes

The Tesseract provides several modes to run OCR only on small regions/blocks or various orientations. The Command-Line argument ‘ — psm’ is used to decide the page segmentation mode.

A list of the PSM (Page Segmentation Modes) supported by tesseract -

Sparse text with OSD
Orientation and Script Detection (OSD)
Sparse text. Find all the possible text in no particular order
Automatic page segmentation with Orientation and Script Detection (OSD)
Treating the image as a single character
Automatic page segmentation, but no OSD, or OCR
Treating the image as a single word in a circle
Assuming a single uniform block of text
The image is treated as a single word
Assuming a single column of text of variable sizes
The image is treated as a single text line
Assuming a single uniform block of vertically aligned text

Tesseract’s limitations

Since this is freeware, it is not as accurate as some commercial solutions (Amazon Rekognition, Google Vision API) that are at present accessible in the business sectors that’s why we are using Google Vision API alongside it.
Dark-shaded pictures are hard to decipher.
If the word size is more modest than 10 pixels, then, at that point the precision of the letters/words perceived will below.
The accuracy of handwriting recognition is very low, almost unacceptable for most handwriting types.
If the size of the clamor is high, it might perceive the commotion content as some ASCII characters.
It may fail to read images with a lot of artifacts.
It may fail to understand words/characters with lines crossed across them like in numerous bills or reports.
Tesseract is not always able to read text across columns or blocks. It will always try to join text across blocks/columns placed way apart.

Google Cloud Vision API

This is quite possibly the most famous ‘cloud-based technology that is accessible today and it gets the most precise data. Google Vision has much more to it than OCR. It’s more than a picture processing structure. Other than OCR it likewise gives Object Detection, Logo Detection, Face Detection, Landmark Detection, Image search, Violence Detection, Nudity Detection, Sentiment Detection, etc.

Building a smart OCR App using Tesseract

DiveDeepAI developed an intelligent web application. It is a new document transcription service that uses Optical Character Recognition (OCR) using Tesseract to transcribe data from image scan files and process said data. It can recognize and transcribe 95% or more of all form text. It supports the various form structures and inherent flexibility allowing easy interface with the existing structures of customers such as databases and excel spreadsheets. Moreover, the website can track usage by user and company with timestamps.

This allows future scaling of the service and to predict value provided to the service’s customers. Besides this, the smart app can prevent access from unauthorized users. All attempts to access the software are subject to user identification and password control. If the Organization’s license has run out, the system will prevent the user from accessing the software functionality.

Tesseract OCR, an application assisting text recognition and digitization proofs the advancement in technology is helping humankind in all fields of life.