Day 16 of 100DaysofML

Charan Soneji
100DaysofMLcode
Published in
4 min readJul 2, 2020

Tesseract OCR. So I was working on a tool for a hackathon on Flipkart Grid which is for UG students and the problem statement asked us to take in an invoice and read in all the contents of the invoice which is in a picture format and enter all the data into an excel sheet. So essentially the conversion is from a picture file to a text file to an excel sheet.

Now, there are several tools available out there in the market whose integration can solve this problem. I thought of clubbing this model along with a neural network to improve its implementation but the drawback that we faced was that the text was a little too small to be read.

I’ll start with the first tool which I came across and I feel a lot of people might find it useful because of its robustness. The installation may be a little complex but I’ll guide you through it.

Installation of Tesseract-OCR
The official github link of the software is:

Once you head onto this link, make sure to click on the installer link based on the 32 bit or 64 bit installers. Once the installer opens, keep everything as default and proceed with the installation. The next thing that you need to do is to set your path variable in case you are using a Windows machine. Once you click , on start, click on the edit path variable option and then in the lower column, look for path.

Edit Path variable → Path → Edit → Add → Name of directory of installation

The name of directory of installation is the folder location which you had set for the installation of Tesseract OCR. For instance, in my case, it is
C:\Program Files\Tesseract-OCR

Steps for setting path variable

The documentation for OCR is given to us by Doxygen or it can also be found in the below link:

How exactly does Tesseract-OCR work?
The working of Tesseract-OCR consists of 2 main steps. It requires one datastage for character recognition, then the second stage to fulfil any letters, it wasn’t insured in, by letters that can match the word or sentence context.

One of the major disadvantages of OCR comes out when the lighting is varied. If the lighting is bad, the overall recognition result turns out to be very bad. This is why the image needs to be preprocessed using a library such as OpenCV (or cv2) and make sure to use the latest version of the OCR (v4) because the earlier versions have comparatively lower accuracies.

Comparison of accuracy based on versions

In the above picture which I found on the internet, you may be able to point out the difference in accuracy based on the version difference and lighting difference so the accuracy of identifying the text is directly impacted.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages “out of the box”. Okay so what does Out of the box mean?
This means that the language has not obtained training before it can read and evaluate the characters of the language but still it manages to identify the characters and sentences.

I’ll put in a real example just for your understanding. So, I’ve taken in an invoice as you can see below:

Invoice demo

Now I’m going to run Tesseract ocr and we shall compare the output with this file.

So, to run tesseract, make sure to run CMD as admin if you wish to and then run the command as it is: tesseract.

On doing so, you may obtain a number of options of the commands that you may use along with the keyword to make the library run on the given file.

One of the important steps is to enter the same folder as that of the picture by using the cd and cd.. commands.

Check out the below GIF in order to get an understanding of the way it works.

Running Tesseract

Just wanted to give an overview of tesseract-ocr and its application in text detection from images. That’s it for today. Keep Learning.

Cheers.

--

--