Quantrium Guides

Installing and using Tesseract 4 on windows 10

Bharath Sivakumar
Quantrium.ai
Published in
7 min readJul 8, 2020

--

Tesseract is an optical character recognition engine which can be used on various operating systems. It’s a free software, released under the Apache License. Originally, Tesseract was developed by Hewlett-Packard as proprietary software in the 1980s, later, it was released as an open source software in 2005. Then from 2006, it’s development is being sponsored by Google. In this guide, I will take you through the steps that I followed in order to install Tesseract on my Windows 10 machine. I shall also show you how you can use tesseract off the command line once you have successfully installed it.

Installing Tesseract 4 on a Windows Machine using .exe File:

To install Tesseract 4 on our Windows system, go to the following link:

Download windows executable file by clicking the hyper link titled tesseract-ocr-w64-setup-v4.1.0.20190314.exe. A notification asking you to save an exe file called “Tesseract-ocr-w64-setup-v4.1.0.20190314.exe” will appear. Save this .exe file wherever you have enough storage space.

Open this exe file. If it windows asks you “Do you want to allow this software to make changes to your system”, click yes. You will be taken to the installation section.

Hit next, click I agree to the terms and conditions and after selecting for whom and all you want to install Tesseract (anyone using this computer/just for me. You can select either one), click next.

Tick the boxes that say “ScrollView”, “Training Tools”, “Shortcuts creation” and importantly “Language data”. These should be ticked by default but just do them just in case they haven’t been ticked in your system.

Now, if you want to make predictions in foreign languages like Japanese, Chinese, Kurdish or Indian languages like Hindi, Tamil, Bengali etc., tick the “additional script data” and “additional language data” as well. If you want to make predictions only for the English language, you don’t have to tick this option.

Click on Next. Select the directory where you want to install Tesseract. By default it shows C:\Program Files\Tesseract-OCR for me and that’s where I installed it. You can install it as per your choice. But do take note of the path where you installed Tesseract on your machine. This is important.

Now you can select the start menu folder in which you would like to create the programs shortcut. I created it in a folder called “Tesseract-OCR”. If you want it in a new folder, just type the name of the folder in the blank space right under the “Select the Start Menu folder in which you would like ….” text.

You can also tick the “Do not create shortcuts” box in the bottom left if you don’t want to create any shortcuts. Once you are done with selecting your preferred option, click install. It should take a few minutes for the installation to happen.

Once the installation is over, go to the directory where you have installed your Tesseract. We want to use Tesseract from our windows command line and to do that, we have to add Tesseract to our path in the system’s environment variable.

To do so, click on your start button on windows and search “environment variable”. You will see a result called “Edit the system environment variables”. Click on that. After clicking this, you should be in the “Advanced” section of “System properties” and a button called “Environment Variables ….” should be visible on the bottom right. Click on that button.

Now, you will see two tables here. One named User variables for <username>. Here, the <username> is a variable that stands for the username using the PC currently. The other table called “System variables”. In the “System variables” table click on the variable called “Path” and then click on this button called “Edit” right above the “OK” button as shown down in the screenshot below.

Set path variable for Tesseract on Windows

Once you’re done with this, you will see a page called “Edit environment variable”. Here on the top right, you will see a button called “New”. Click on that “New” button. You will get a blank space where you can add some text. Here, add your directory name where all your Tesseract-OCR files are stored.

Once you have keyed in the directory name, hit “Enter” and check if your directory name has been added to the “Edit environment variable table”. Once it has been, click “OK”. Click on OK again in the “Environment Variables” page. Click “OK” in the “System Properties” page again. You must have exited from all the settings options now.

Open command prompt and type tesseract --version on the command prompt and hit enter. You will see something like this:

Output for tesseract — version command after tesseract was successfully installed

If you see any error like tesseract command not found, most probably you have made some mistake while following this guide. Go back and see where have you gone wrong and try to fix it. Alternatively, you can repeat the whole process again.

Great! Now you have Tesseract installed on your machine. You can start playing around with it and explore it further.

How to use Tesseract 4 using Command Line on a Windows Machine

First, make sure you have some handwritten document or some typed document in the form of an image. Let’s say you have some photo in png form called handwritten_photo_1 on your Desktop and want to test Tesseract with it. Open your command prompt. You will start in this directory:

C:\Users\username>

where username is your username on that system. I need to go to the desktop directory. So I use the following command:

C:\Users\username> cd Desktop

Now I am in the Desktop directory, where my image is located. You can see what Tesseract predicts the text in the document using the following command:

C:\Users\username\Desktop> tesseract handwritten_photo_1.png stdout -l eng

Tesseract will directly output the text in the command line itself. The -l parameter is used to specify the language. Here we have specified it as English, which is the case by default anyway, so using -l eng was redundant in this case. If you want to use some other language for OCR, check this link here which has all the .traineddata files, which specify the language:

Say you have a text document written in Hindi. Then, go to this above link, click on the file titled hin.traineddata and download it. Once you have downloaded it, you need to move to the “tessdata” folder, which will be inside your directory where you had originally installed tesseract. Once you have done that, you can perform OCR of Hindi documents by using the following command:

C:\Users\username\Desktop> tesseract hindi_image.png stdout -l hin

Instead of displaying the OCR output on the command line itself, let’s say you want your OCR output to be stored in a text file. In that case you can enter the following command instead:

tesseract handwritten_photo_1.png output.txt

The text in handwritten_photo_1.png will be stored in a text file called output.txt which will be located in your present working directory, which was Desktop in my case.

Tesseract can also take a text file as input, where the text needs to contain all the absolute path of the images that you want to process.

This is especially useful when, let’s say you have two images handwritten in English called handwritten_photo_1.png and handwritten_photo_2.png in the C:\Program Files directory. Now, in your present working directory, you have a text file called input.txt whose contents are:

C:\Program Files\handwritten_photo_1.png
C:\Program Files\handwritten_photo_2.png

In the first and second line respectively.

Now if you want to store the contents of the these two handwritten photos in a text file, you can just do the following:

tesseract input.txt output.txt -l eng

output.txt will have the OCR contents of both handwritten_photo_1.png and handwritten_photo_2.png, in that order. Here, you should note that input.txt was in the current working directory. You can use tesseract on a text file which is not in your present working directory either by including the directory location like here:

tesseract C:\Program Files\input.txt output.txt -l eng

output.txt will again be located in the present working directory. You can do this for more than two photos as well. Note that the prediction for a new photo in the output.txt file will be preceded by some symbol as:

Tesseract output of an input text file with 5 lines of image locations

So in this case, Viral Calic is the prediction for the first image, CY am the king of the world the prediction for the second image, Com and Serr the prediction for the third image and so on. You can check the output for all your input images and check the accuracy of the predictions.

That’s it! Congratulations, you are now all set and ready to use Tesseract on your Windows 10 system.

--

--

Bharath Sivakumar
Quantrium.ai

A Machine Learning enthusiast who wants to make Machine Learning tools accessible to everybody