Training Tesseract for labels, receipts and such


So you decided to take up OCR scanning for your project?

- Good for you!

However it might look a bit intimidating when looking at the official doc on how to do it, it’s actually quite satisfying when you’re done and you have a very fast and accurate scanner in your app.

For me, the official doc did tell me everything I needed to know about training Tesseract, but it took me quite some time to find a good way to approach training, find the best tools for my purpose and so on, so I decided to write a more condensed guide for anyone else that needs to train with bad or insufficient data.

If you do have a good font file to install on your computer you can get away with far less than this, there is even automatic training tools available out there.

But if you end up with just some receipts, a label from a label printer or something, where you just can’t create good material from a real font that you can install on your computer, this guide is for you.

This entire document is meant to be a step by step guide, and you will need to complete each step to make it work.

Some things will be explained in more details, and some will just be a command for you to type into your terminal.

It does help to read this entire document once before actually starting.

1. Install Tesseract on your computer

This step will be different depending on how you like to work and what system you are on.

I used brew to install Tesseract on my Mac.

!Important note

Treat ALL WARNINGS as an ERROR when you train. If you get warning somewhere it means that the final build WILL NOT WORK and you have to redo the entire process from the failing step, and chances are that you won’t remember.

So if anything gives you a warning -fix it and run that step again!

2. Get your TIF’s in order

First of all, you need some sample data to train the scanner with. Try to get images as clean as possible, shape them up with Photoshop or equivalent software, make the background white and the text black.

The more samples you have — the better.

You will need at least every character that you will need to recognize to be in the images.

Cut out everything else from the image and put the characters on a single line.

Make sure that each character is separated with enough space to not make them bleed over each other.

Create at least five different images with the characters in and in different order.

Remember that Tesseract will try to learn to recognize words, so be careful if you plan to scan codes that you don’t always put characters in the same order for each sample.

Create a folder somewhere on your filesystem to keep all the training files in one place.

Name the images with [language].[fontname].exp[samplenumber].tif

eng.strangelabelmachinefont.exp0.tif

Example image with a phone number

3. Create the box files

Now, for each of the sample files, run Tesseract to create the box files.

A box file is a register of all the characters that Tesseract recognizes and at which position that character is.

Open up that good ol’ terminal and type in for each of the TIF’s:

tesseract [language].[fontname].exp[samplenumber].tif [language].[fontname].exp[samplenumber] batch.nochop makebox

or as in our case:

tesseract eng.strangelabelmachinefont.exp0.tif eng.strangelabelmachinefont.exp0 batch.nochop makebox

4. Correct the box files

Now its time to take a look at the box files we created in the previous step.

+ 27 21 57 50 0

4 65 20 89 55 0

? 92 20 116 55 0

5 119 23 142 58 0

8 147 23 171 58 0

0 175 23 199 58 0

0 202 22 226 57 0

1 228 22 243 57 0

0 248 22 271 57 0

2 274 22 299 57 0

0 302 22 326 57 0

0 330 22 354 57 0

2 358 22 382 57 0

6 388 22 413 56 0

6 417 21 441 56 0

The leftmost character on each row is the character that Tesseract thought it did find.

The rest is coordinates in pixels left/top and so on.

As you can see it made a mistake with the character “7” and guessed it to be a “?”.

The positioning of the characters would be VERY hard to guess unless you have an amazing talent for imagining pixels in your head.

Luckily there are some tools available to help you with this step.

The only tool that i found to work and/or to be useful is jTessboxeditor.

You can get it here: http://vietocr.sourceforge.net/training.html

Correct the characters that were wrong, and make sure that the surrounding boxes fit the entire characters and in place. If not — correct the values on the top row.

Once you are happy and done, press save and move on to the next file.

5. Training time

Now that you have some good boxes its time to start the actual training of the scanner.

For each of your TIF/Box pairs, run the following command

tesseract [language].[fontname].exp[samplenumber].tif [language].[fontname].exp[samplenumber] box.train

or

tesseract eng.strangelabelmachinefont.exp0.tif eng.strangelabelmachinefont.exp0 box.train

6. Create the unicharset file

Run the unicharset_extractor with each of the boxes as a parameter

unicharset_extractor eng.strangelabelmachinefont.exp0.box eng.strangelabelmachinefont.exp1.box…

You will probably not need to edit this file, unless you are on some strange old system like windows 95.

7. Create the font_properties file

Create a new file and name it lang.font_properties.

In this file, create a row for each font you are using in your training files.

If you are like me, trying to scan a receipt or label with a strange unknown font, you will likely just need one row.

Each row starts with the name of the font, then it will have a boolean value for each of the possible font properties.

<fontname> <italic> <bold> <fixed> <serif> <fraktur>

Example:

somestrangelabelmachinefont 0 0 1 0 0

Important — Make sure to add an extra line break at the end of the file.

8. Clustering

Time to cluster all the features of the trained font.

Enter the following in the terminal:

shapeclustering -F font_properties –U unicharset [language].[fontname].exp0.tr [language].[fontname].exp1.tr…

9. Shapetable

Enter the following in the terminal:

mftraining -F font_properties –U unicharset [language].[fontname].exp0.tr [language].[fontname].exp1.tr…

10. Normproto

Enter the following in the terminal:

cntraining [language].[fontname].exp0.tr [language].[fontname].exp1.tr…

11. unicharambigs

This file is manually created, and is supposed to have a list of commonly mistaken characters and what to substitute when.

If you need it, please read the official manual, for training to scan a code it is usually not neccesary to put anything in it.

The file however needs to be there, so go ahead and create it and name it

language.unicharambigs

Put in ”v1” at the first row if you are using an older version of Tesseract for some reason, and ”v2” if you are running version 3.03 or higher.

Also put in a blank line at the end of the file

12. Wrappin everything up

Now you’re good to go ahead and create the final training file that will be used in your app unless you had ANY errors prior to this step. (You did read the warning in the beginning of the document, right?)

If you take a look now in your folder with the trained data, it will contain lots of new files, rename each one of them that doesn’t have a language prefix so they do have that.

language.filename

Then run the final command

combine_tessdata lang.

Do NOT leave out the dot in the end of the language name.

Now grab that smoking fresh baked traineddata file and put it in your project.

language.traineddata

Useful Links

Official guide :

https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

jTessboxEditor:

http://vietocr.sourceforge.net/training.html