Tesseract is an optical character recognition engine which can be used on various operating systems. It’s a free software, released under the Apache License. Originally, Tesseract was developed by Hewlett-Packard as proprietary software in the 1980s, later, it was released as an open source software in 2005. Then from 2006, it’s development is being sponsored by Google. In this guide, I will take you through the steps that I followed in order to install Tesseract on my Windows 10 machine. I shall also show you how you can use tesseract off the command line once you have successfully installed it.
Installing Tesseract 4 on a Windows Machine using .exe File:
To install Tesseract 4 on our Windows system, go to the following link:
Index of /tesseract
These executables are provided by Mannheim University Library. Licensed under the Apache License, Version 2.0 (the…
Download windows executable file by clicking the hyper link titled tesseract-ocr-w64-setup-v220.127.116.1190314.exe. A notification asking you to save an exe file called “Tesseract-ocr-w64-setup-v18.104.22.16890314.exe” will appear. Save this .exe file wherever you have enough storage space.
Open this exe file. If it windows asks you “Do you want to allow this software to make changes to your system”, click yes. You will be taken to the installation section.
Hit next, click I agree to the terms and conditions and after selecting for whom and all you want to install Tesseract (anyone using this computer/just for me. You can select either one), click next.
Tick the boxes that say “ScrollView”, “Training Tools”, “Shortcuts creation” and importantly “Language data”. These should be ticked by default but just do them just in case they haven’t been ticked in your system.
Now, if you want to make predictions in foreign languages like Japanese, Chinese, Kurdish or Indian languages like Hindi, Tamil, Bengali etc., tick the “additional script data” and “additional language data” as well. If you want to make predictions only for the English language, you don’t have to tick this option.
Click on Next. Select the directory where you want to install Tesseract. By default it shows
C:\Program Files\Tesseract-OCR for me and that’s where I installed it. You can install it as per your choice. But do take note of the path where you installed Tesseract on your machine. This is important.
Now you can select the start menu folder in which you would like to create the programs shortcut. I created it in a folder called “Tesseract-OCR”. If you want it in a new folder, just type the name of the folder in the blank space right under the “Select the Start Menu folder in which you would like ….” text.
You can also tick the “Do not create shortcuts” box in the bottom left if you don’t want to create any shortcuts. Once you are done with selecting your preferred option, click install. It should take a few minutes for the installation to happen.
Once the installation is over, go to the directory where you have installed your Tesseract. We want to use Tesseract from our windows command line and to do that, we have to add Tesseract to our path in the system’s environment variable.
To do so, click on your start button on windows and search “environment variable”. You will see a result called “Edit the system environment variables”. Click on that. After clicking this, you should be in the “Advanced” section of “System properties” and a button called “Environment Variables ….” should be visible on the bottom right. Click on that button.
Now, you will see two tables here. One named
User variables for <username>. Here, the
<username> is a variable that stands for the username using the PC currently. The other table called “System variables”. In the “System variables” table click on the variable called “Path” and then click on this button called “Edit” right above the “OK” button as shown down in the screenshot below.
Once you’re done with this, you will see a page called “Edit environment variable”. Here on the top right, you will see a button called “New”. Click on that “New” button. You will get a blank space where you can add some text. Here, add your directory name where all your Tesseract-OCR files are stored.
Once you have keyed in the directory name, hit “Enter” and check if your directory name has been added to the “Edit environment variable table”. Once it has been, click “OK”. Click on OK again in the “Environment Variables” page. Click “OK” in the “System Properties” page again. You must have exited from all the settings options now.
Open command prompt and type
tesseract --version on the command prompt and hit enter. You will see something like this:
If you see any error like
tesseract command not found, most probably you have made some mistake while following this guide. Go back and see where have you gone wrong and try to fix it. Alternatively, you can repeat the whole process again.
Great! Now you have Tesseract installed on your machine. You can start playing around with it and explore it further.
How to use Tesseract 4 using Command Line on a Windows Machine
First, make sure you have some handwritten document or some typed document in the form of an image. Let’s say you have some photo in png form called
handwritten_photo_1 on your Desktop and want to test Tesseract with it. Open your command prompt. You will start in this directory:
username is your username on that system. I need to go to the desktop directory. So I use the following command:
C:\Users\username> cd Desktop
Now I am in the Desktop directory, where my image is located. You can see what Tesseract predicts the text in the document using the following command:
C:\Users\username\Desktop> tesseract handwritten_photo_1.png stdout -l eng
Tesseract will directly output the text in the command line itself. The
-l parameter is used to specify the language. Here we have specified it as English, which is the case by default anyway, so using
-l eng was redundant in this case. If you want to use some other language for OCR, check this link here which has all the
.traineddata files, which specify the language:
These language data files only work with Tesseract 4.0.0. They are based on the sources in tesseract-ocr/langdata on…
Say you have a text document written in Hindi. Then, go to this above link, click on the file titled
hin.traineddata and download it. Once you have downloaded it, you need to move to the “tessdata” folder, which will be inside your directory where you had originally installed tesseract. Once you have done that, you can perform OCR of Hindi documents by using the following command:
C:\Users\username\Desktop> tesseract hindi_image.png stdout -l hin
Instead of displaying the OCR output on the command line itself, let’s say you want your OCR output to be stored in a text file. In that case you can enter the following command instead:
tesseract handwritten_photo_1.png output.txt
The text in
handwritten_photo_1.png will be stored in a text file called
output.txt which will be located in your present working directory, which was Desktop in my case.
Tesseract can also take a text file as input, where the text needs to contain all the absolute path of the images that you want to process.
This is especially useful when, let’s say you have two images handwritten in English called
handwritten_photo_2.png in the
C:\Program Files directory. Now, in your present working directory, you have a text file called
input.txt whose contents are:
In the first and second line respectively.
Now if you want to store the contents of the these two handwritten photos in a text file, you can just do the following:
tesseract input.txt output.txt -l eng
output.txt will have the OCR contents of both
handwritten_photo_2.png, in that order. Here, you should note that
input.txt was in the current working directory. You can use tesseract on a text file which is not in your present working directory either by including the directory location like here:
tesseract C:\Program Files\input.txt output.txt -l eng
output.txt will again be located in the present working directory. You can do this for more than two photos as well. Note that the prediction for a new photo in the
output.txt file will be preceded by some symbol as:
So in this case,
Viral Calic is the prediction for the first image,
CY am the king of the world the prediction for the second image,
Com and Serr the prediction for the third image and so on. You can check the output for all your input images and check the accuracy of the predictions.
That’s it! Congratulations, you are now all set and ready to use Tesseract on your Windows 10 system.