How does Tesseract-OCR work with Python?

latif vardar
10 min readFeb 21, 2020

--

This article is a guide for you to recognize characters from images using Tesseract OCR, OpenCV and Python.

OCR

Optical Character Recognition (OCR) recognizes texts inside images, such as scanned documents and photos, then it converts any kind of images containing written text into machine-readable text data.

Early versions needed to be trained with images of each character and worked on one font at a time. Advanced systems capable of producing a high degree of recognition accuracy for most fonts are now common, and with support for a variety of digital image file format inputs.

There are many optical character recognition software available. Some of them are Tesseract, OCRopus, Ocular, and SwiftOCR. There are no vast quality differences between them.

In this tutorial, we will focus on Tesseract OCR.

Tesseract OCR

Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly or by using an API to extract text from images. It supports a wide variety of languages. Tesseract is compatible with many programming languages and frameworks through wrappers that can be found here. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single text line.

Tesseract 4.00 includes a new neural network subsystem configured as a text line recognizer. The neural network system in Tesseract pre-dates TensorFlow but is compatible with it, as there is a network description language called Variable Graph Specification Language (VGSL), that is also available for TensorFlow.

To recognize an image containing a single character, we typically use a Convolutional Neural Network (CNN). Text of arbitrary length is a sequence of characters, and such problems are solved by using RNNs and LSTM is a popular form of RNN. You can find more detailed information about LSTM via this link.

OpenCV

OpenCV (Open source computer vision) is a library of programming functions mainly aimed at real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage then Itseez (which was later acquired by Intel). The library is cross-platform and free for use under the open-source BSD license.

OpenCV supports the deep learning frameworks TensorFlow, Torch/PyTorch and Caffe.

1. Installation

Tesseract can be installed in different ways. In this chapter, we will install requirements via pip on Windows. You can check the required steps via these links ( [1] and [2] ). These links help you while installing tesseract. In addition, if you would like to use other operating systems, you could find them easily via same links.

Tesseract 4.1.1 will be used in this article.

Github files

Firstly, please download all files which are in this link to your computer. After that, extract them to a directory. The directory should be in an easily accessible location. Remember that test images will be saved to the same directory.

Python

Python with version 3.6 or 3.7 should be installed in your computer. Python 3.7 was installed for this tutorial.

Python Modules

According to the python version and operating system, you should install required files via pip. First of all, you can install the python wrapper for tesseract.

Tesseract library is shipped with a handy command-line tool called tesseract. We can use this tool to perform OCR on images and the output can be stored in a text file. If you would like to integrate Tesseract in your C++ or Python code, you should use Tesseract’s API.

Our modules and their versions are: tensorflow 1.13.0rc0, opencv2, matplotlib, shapely.

We will be following this code for installation of python modules:

Example:

When version is not necessary, we will be following this code:

Example:

After all, Pillow (Python Imaging Library) module should be installed for image processing. Following the command:

Language Data Files:

OCR Language Data files contain pretrained language data from the OCR Engine. They are required during the initialization of the API call. There are 3 types of data files.

We will be using standard tessdata. The standard model that only works with Tesseract 4.0.0. contains both legacy engine ( — oem 0) and LSTM neural network based engine ( — oem 1). You can use the link to download tessdata documents. In addition, after downloading the file, place it in the directory of your project.

2. Preparing Test Images

Saving Images:

There are multiple ways to get test images efficiently. One of them, we will be using in this tutorial, is taking photographs then, filtering the photographs using 10 different filters. Thereby, number of test images increases.

“If you have more test data, your machine learns faster.”

Finally, the test images should be saved to the directory of your project.

Adjusting Images and Creating Box Files:

The test images should be given a suitable shape to improve the accuracy.

Please check out the link to improve quality of image using different methods.

By default, Tesseract expects a page of text when it segments an image. If you’re just seeking to OCR a small region, try a different segmentation mode, using the — psm argument.

There are several ways a page of text can be analyzed. The tesseract API provides 14-page segmentation modes if you want to run OCR on only a small region or in different orientations, etc.

These page segmentation modes are:

** To change your page segmentation mode, change the — psm argument in your custom config string to any of the above mentioned mode codes.

There is also one more important argument, OCR engine mode (oem). Tesseract 4 has two OCR engines — Legacy Tesseract engine and LSTM engine. There are four modes of operation chosen using the — oem option.

-Pre-processing for Tesseract

To get the best results and ideal accuracy while using Tesseract, you need pre-processed form of the images. This includes thresholding, erosion, deskewing, etc.

Whilst implementation of pre-processing, we will use OpenCV. You can visit the link for more details.

-Whitelisting Characters

If you would like to detect certain characters from the images and ignore the others, you can create your white list by using the following code.

-Blacklisting Characters

If you would not like some characters to exist in your output, you can create a blacklist by using the following config.

After resizing and adjusting the images, we should create box files. Tesseract needs a ‘box’ file to go with each training image. The box file is a text file that lists the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image.

Open up the terminal and type the command line for each of your training images.

For example:

Correcting of Box Files:

Now, we have to correct the box files.

Remember that every photo could not be taken properly, and the positioning of the characters would be very hard to guess.

There are some tools to help us. One of them and the most useful one is ‘jTessboxeditor’. Thanks to the box editor, we can correct our characters manually.

You can reach out jTessboxeditor via this link.

3. Training

It is time to train our data. We will run the following code for each of our TIF/Box pairs.

For example:

4. Creating Our Own Language

Tesseract needs some steps to be done to create the language. These are:

Creating the unicharset files:

Tesseract’s unicharset file contains information on each symbol (unichar) the Tesseract OCR engine is trained to recognize.

Run the following code with each part of the boxes as a parameter.

You can get unicharset files from GitHub via this link.

Creating the font_properties file:

Firstly, we create a new file in the directory, and name it lang.font_properties. In this file, we create a row for each font we are using in training files.

Each row starts with the name of the font, then it will have 0 or 1 value for each of the possible font properties.

For example:

Clustering:

After that, we cluster trained font according to their shapes.

Please enter the following code to the terminal:

Normproto:

After we type the following code to the terminal, our language is ready.

5. Combining

Before combining step, we should control our files about language. The files all should have the language prefix.

Now, we add final command to the terminal. This command provides combining everything.

Finally, our trained data is ready with the following code.

We should move our new combined language inside the tessdata file. Tesseract-OCR needs our language to recognize our characters.

Limitations of Tesseract

*It works best when there is a clean segmentation of the foreground text from the background.

* The better the image quality (size, contrast, lightning) the better the recognition result. (DPI of the images should be greater than 70)

* The images usually need to be de-skewed if the image is not aligned properly and reoriented.

* The images need to have as much image contrast as possible.

* It is not capable of recognizing handwriting.

* If a document contains languages outside of those given in the -l lang argument, results may be poor.

* It is not always good at analyzing the natural reading order of documents. For example, it may fail to recognize that a document contains two columns and may try to join text across columns.

* It does not expose information about what font family the text belongs to.

Results

Input:

Output:

Conclusion

Tesseract is perfect for scanning clean documents and comes with pretty high accuracy and font variability since its training is comprehensive. The results in above section are obtained by extracting passwords from cup covers and they are successful enough. As we are interested in our OCR code, we have faced four problems. These are,

*In some situations, machine cannot separate some letters from numbers, and vice versa. For instance, machine can understand letter ‘Z’ as a number ‘2’. You can cope with this problem by collecting more data. Another option is that you can use the blacklist feature of Tesseract if there is no number ‘2’ in your passwords.

* Sometimes, the input images could be specific and difficult because readability of the images depends on some conditions such as day-night, quality of camera, flash mode. We have dealt with the problem by using different filters on images.

*Makefile in which lanms file may cause a problem for you. This link helps you to fix it.

*If you have too much data, it causes a problem for you in some cases. When you face with the problem, you can write your code on Ubuntu instead of Windows, MacOS.

And, you can find detailed information in above sections.

Thanks for reading this tutorial.

--

--