Quantrium Guides

Training Tesseract on your custom dataset using Qt Box Editor

Bharath Sivakumar
Quantrium.ai
Published in
9 min readJul 12, 2020

--

In this guide, I will take you through the steps that I followed in order to train Tesseract using the Qt Box Editor and improved its prediction on certain types of images in which it was performing poorly. Before we begin, these are some of the tools I had at my disposal, and if you want to execute everything in this tutorial without any error, I would suggest you to have the same:

  • A Windows 10 PC with Tesseract installed
  • A Google cloud computing services account , with a Google compute engine instance with an Ubuntu 18.04.4 OS that you can SSH into

If you haven’t already installed Tesseract on your windows 10 system, check out my other post on Medium titled Installing and using Tesseract 4 on windows 10" where I help you with the process.

Using QT Box Editor on Windows 10 to Edit Tesseract Box Files

Now to train Tesseract on images in which it is doing poorly, we need to train it using something known as a box file. A box file is basically a type of file that Tesseract generates, where we can see how Tesseract made its predictions character by character and where it detected those characters.

In these files, boxes are drawn around what Tesseract thinks are characters and tells us what it thinks is most likely the character inside that box. So we train Tesseract by editing these box files and telling where it has gone wrong. But before editing them, we need to generate these box files first. But before I tell you how to generate them, you need to do something. Make sure that the names of all the images that you are going to use for training your image follow the following format:

<language name>.<fontname>.exp<file number>

For example, if your images are all English in the Arial font, and you want to train tesseract using 5 images, make sure your files have the name format:

eng.arial.exp0
eng.arial.exp1
eng.arial.exp2
eng.arial.exp3
eng.arial.exp4

Now, we are ready to generate the box file for each of your image. The box files for all your images can be obtained by using the following command for each image:

tesseract -l eng eng.arial.exp0.jpg eng.arial.exp0 batch.nochop makebox

Run this command for all your images. The only thing that will change in each image will be in the .exp0 part. For your second image the command will be:

tesseract -l eng eng.arial.exp1.jpg eng.arial.exp1 batch.nochop makebox

For third image:

tesseract -l eng eng.arial.exp2.jpg eng.arial.exp2 batch.nochop makebox

and so on.

Each box file will be generated in the present working directory. Make sure that all the images and the box file their corresponding box files are all in the same directory. This is important.

Now to edit these box files, we will need a new software called the Qt Box editor. Which can be downloaded from the following url:

I downloaded the “qt-box-editor-1.12rc1b-portable.zip” for my windows 64 system. After downloading the zip file, extract all the contents in the zip file to wherever you have storage space. The extracted contents will contain an exe file called “ qt-box-editor-1.12rc1.exe” and run it.

Before opening any image, go to “edit”, click on “settings”, then click on the tesseract section and make sure that the directory specified in “TESSDATA_PREFIX” is the original directory where you installed Tesseract-OCR. For me, it was C:/ProgramFiles/Tesseract-OCR/.

In the language section, select the language that your handwritten images are written in . For me it was English. Click OK after selecting the language. Now just for verification, close the application, open it again and check to see if the “TESSDATA_PREFIX” and “Language” section have changed to the appropriate results as mentioned above.

This software is a little buggy, so you might have to do the procedure specified above a few times before everything is in order.

Now, you want to open each image and edit it. Note that, here, you want to open the image and not the box files themselves. To open an image using the app, click on File in the top left and hit open. Now you can select the photo whose box file you have already generated and seen. You can see some boxes around each text and the predicted output in each box on the left side. You should see something like this:

The red box highlighted here are the predictions made by Tesseract and the boxes corresponding to where tesseract thinks each of these characters appear. You can edit the boxes by clicking on each letter and extra boxes can be added by clicking on the plus button on the bottom left.

If you don’t see this highlighted box and only the image is opened, then you have made some mistake somewhere. Either your tesseract directory is not mentioned properly in the Qt box editor or your file name for each image is not specified like how I asked you too, or your box files haven’t been generated properly. See which one of these is wrong and fix it. Once you have edited the box files for all the images, you are now ready to start training.

Training Tesseract on the Box Files on Ubuntu 18.04:

We shall be using the script in the following tutorial to train our Tesseract:

We are going to use a script in this article in a section called “Time to train Tesseract to recognize letters properly”, and modify it to our needs.

Well you can write this script in your Ubuntu system in the cloud directly itself and then execute it, but I suggest you copy this script onto something like Notepad++ in your windows system, where you can write bash files in windows and try to modify it according to your needs. This will be easier to do since writing and editing scripts directly from the command line is a little difficult if you haven’t done it before but of course you can do so if you wish.

Once you paste this onto Notepad++, go to edit, then to “EOL Conversion” and click on Unix. If you don’t do this, then your Ubuntu system won’t understand the spaces and newlines in the script and won’t execute your file.

Now before you hit save, modify this file according to your needs. First, change the N in the seventh line to your needs, if you are going to train tesseract on 7 images, make sure N is 6 or if you have 20 images make sure N is 19 and so on. Then in the 10th line, that is commented, just turn every pol. into eng. if your handwritten documents are in English, jpn, if they are in Japanese and so on.

You don’t have to un-comment this line now, but if you rerun this script, you have to un-comment this line and then run the script.

Now the Tesseract is trained on the images provided in png format in English language with the Vivaldi font so we have to change the line number 13 from pol.ocrb.exp$i.tif to eng.vivaldi.exp$i.png and similarly we have to changepol.ocrb.exp$i to eng.vivaldi.exp$i in the same line. You should change it according to your format/needs.

Similarly, pol.ocrb.exp on line 15th is required to be changed to eng.vivaldi.exp. In the 16th line, change the ocrb to the font your document and modify the numbers depending on your font.

The font used for images in this guide was Vivaldi so I changed ocrb 0 0 1 0 0 to vivaldi 1 0 0 0 0 since my font, Vivaldi, was Italics only and was not in bold or monospace or serif or fraktur.

Similarly in 17th line, in the last part I had to change pol.ocrb.exp to eng.vivaldi.exp. And in 18th line, I changed pol.ocrb.exp to eng.vivaldi.exp. In line numbers 20th, 21st, 22nd and 23rd, I modified pol.inttemp, pol.normproto, pol.pffmtable, pol.shapetable to eng.inttemp, eng.normproto, eng.pffmtable and eng.shapetable respectively.

In 24th and final line, change the line depending on your language. Since all my text is in English, I write combine_tessdata eng.. If you use Japanese, make it combine_tessdata jpn. and so on.

As already mentioned a few times before, you need to make changes according your specifications. All the necessary modifications that are required to be made in the script are completed, so now we are ready to save this file. But, when saving, right before you hit save, you will see this option called “Save as type” that you can change. Click on that and select the save type as “Unix script type”. Name the file whatever you want, I named it “train” and save it.

So far so good! Everything you need to configure on your windows system is completed. Now, go to

Login using your Google Cloud id and create a Compute Engine Instance with Ubuntu 18.04.4 LTS. Once you have created the Compute Engine instance, SSH into it.

Upload the following files from your windows system onto the cloud instance:

  • All the images that you want to train Tesseract on, and its edited box files.
  • The Unix script file that you created in the end which I called “train”.

Once the upload is completed, you need to install Tesseract and its training tools onto this Ubuntu machine. You can do that using the the following commands:

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

That’s it. You have installed all the tools needed to train Tesseract. Now let’s check the version of our Tesseract and see if it’s installed properly and runs properly.

tesseract — version

You should see an output similar to the one you saw when you typed the same command in the windows system.

Now you do not know where Tesseract and all its libraries have been installed. To find it let’s use the following command:

sudo find / -name "tessdata" 

This should give you the directory where tessdata is located. For me it’s /usr/share/tesseract-ocr/4.00/tessdata. “tessdata” is the folder where Tesseract stores all its training data on the various languages it has been trained on. Remember this folder, you will need to use this folder later.

Now go back to your home directory using the command:

cd /home/<username>

Since this is the directory where all your uploaded files reside. Make sure all your files: images, box files and script file are in this folder. Now you just need to run the script file. Do this by executing the following two commands in sequence:

chmod +x train
./train

Here train is the name of my script file. If your script file is called something else, put that name in place of the “train” in both the lines. Once the script has been executed successfully. Hit the ls command to check that files in your present working directory.

You will see an extra .tr file for every image file that you used for training as well as six other files called eng.inttemp, eng.normproto, eng.pffmtable, eng.shapetable, eng.unicharset and finally, eng.traineddata. Of course the prefix eng. will change depending on the language in the text.

If you don’t see these six files, you have made some error. Go back and check where you made one. It’s highly likely that your script may have some mistake. Try to find and fix it.

Now, this eng.traineddata file is what you have been trying to create. This is your trained Tesseract file. Its basically has all the data about the mistakes Tesseract made in your images and your corrections. Now, you want to move this file to the tessdata folder.

However, tessdata already has a file called eng.traineddata that has information of thousands of images in the English language and it is the language file that Tesseract uses to detect text in the English language in your images.

Here, you want to add more capability to your Tesseract and hence you do not have to remove or overwrite this file. So, while moving your eng.traineddata file you should rename it to eng1.traineddata using the following command:

mv eng.traineddata /usr/share/tesseract-ocr/4.00/tessdata/eng1.traineddata

Here, mv stands for move but it can also be used for renaming files while moving. Here the directory name entered after eng.traineddata in the above command is where your tessdata folder is located.

Congratulations, your training of tesseract is now complete. Now use Tesseract with your custom training, type the following command to try it out:

tesseract eng.vivaldi.exp0.png stdout -l eng1

Remember to specify the language as eng1. It refers to your trained Tesseract data file and will be used to predict the characters in the image.

You should train Tesseract only when its making poor predictions in your specific use case. For example, when you are using Tesseract to detect text in some very obscure font. In my case I was making predictions on the Vivaldi font which is an italics font and Tesseract performs very poorly.

Hope you find this guide helpful to train Tesseract specifically for your use case and font type. I hope you will be successful in following and implementing this guide.

--

--

Bharath Sivakumar
Quantrium.ai

A Machine Learning enthusiast who wants to make Machine Learning tools accessible to everybody