Optical Character Recognition (OCR) for Low Resource languages with Tesseract version.

Isuri Anuradha
6 min readOct 4, 2019

This blog discuss about a popular techniques and steps used in Optical Character Recognition.First part consists of a brief introduction to the OCR and tesseract and the second part discuss about the sinhala font training. Optical character recognition (optical character reader) is the electronical or mechanical conversion of images of typed, printed or handwritten into machine encoded text. Following can be consider as the types of OCR’s in the real world.

  • Optical Character Recognitions: One character at a time and targets type written documents and texts.
  • Optical Word Recognition: One word at a time and target type written documents and texts.
  • Intelligent Character Recognition: One character at a time and targets hand written documents at a time
  • Intelligent Word Recognition: One word at a time and targets hand written documents at time.

What is “Tesseract” ?

Tesseract is an open source Optical character recognition engine under Apache License 2.0 which helps to read text from the document (e.g. pdf, jpg or png images, etc).

Evaluation of the Tesseract.

In the evaluation of the tesseract initially it was developed by Hewlett Packard in the 1980s. And Google take over the project from 2006. As a open source project first version of the tesseract consists of connected component analysis and recognize the process as the two way procedure. And next version tesseract, google have expand their knowledge base of OCR to other low level language for recognizing the new characters.

Evaluation of the Tesseract

When considering the Tesseract 4.0 version following improvements can be noted.

  • Usage of deep learning model: Long Short-Term Memory (LSTM) neural network.
  • Includes a neural network subsystem configured and include a textline recognizer.
  • Improving the full layout analysis, table detection, equation detection, better language model, improved segmentation search and better hand written models.

Tesseract system architecture and word recognizer

The following images describes about the architecture of the tesseract system and the process of word recognizing.

Tesseract system can consider as a nominally a pipeline, but not actually, since there is a lot of re-visiting on the old decisions.

System architecture of the tesseract

When considering the architecture of the tesseract initially as the input gray or colour image is supplied to the system and applied adaptive thesholding (assists to cleansing out the dirty images of coloured backgrounds)then binary image will be produced. After that from connected component Analysis character outlines are defined and determine the text outlines and the words in the given corpus. Then after character outlines were properly arranged and organized into the words. As the last step Tesseract system will recognize the the word and pass as the output of the system.

Tesseract word recognizer

Challenges found in the OCR

  1. Font specifies : Understand a limited number of fonts and page formats.
  2. Differences in the character bounding boxes
  3. Extracting unreliable features

So let’s move onto the second part of the blog.

Training Sinhala font using tesseract 4.0 version

Prerequisites:

Install all additional libraries needed to run tesseract 4.0 version. Refer link [1] to install all libraries.

Build all the training tools required for compilation of the tesseract 4.0 [2].

Download and install required data files [3], [4] and [5].

Process of training:

Follow the below steps for the configuration of the new fonts from the tesseract 4.0.

01. Generate training data

rm -rf train/*
tesstrain.sh — fonts_dir fonts \
— fontlist ‘Noto Sans Sinhala’ \
— lang sin \
— linedata_only \
— langdata_dir langdata_lstm \
— tessdata_dir tesseract/tessdata \
— save_box_tiff \
— maxpages 10 \
— output_dir train

fonts_dir (directs the font directory)
fontlist (specifies the type of the font that your tring to train)
lan (language of training data / font type should be displayed on sin folder)
langdata_dir (specify the language data)
tessdata_dir (specify the location of tessdata available(google))
save_box_tiff (save both tif/box/lstmf files)
maxpages (specifies the mage count if it’s a large corpus)
output_dir (specifies the location to save the output data)

02. Extract the generated model

combine_tessdata -e tesseract/tessdata/sin.traineddata sin.lstm

03. Make eval data for the ‘Impact’ font

lstmeval — model sin.lstm \
— traineddata tesseract/tessdata/sin.traineddata \
— eval_listfile train/sin.training_files.txt

04. Decreasing the error rate (using the fine tunning mechanism)

rm -rf output/*
OMP_THREAD_LIMIT=8 lstmtraining \
— continue_from sin.lstm \
— model_output output/pubg \
— traineddata tesseract/tessdata/sin.traineddata \
— train_listfile train/sin.training_files.txt \
— max_iterations 400

model_output (specifies the location of the output data should be saved)
traineddata (specifies the location of the trained data/google)
train_listfile (trained data file)
max_iterations (iterations of the neural network)

More iterations might be results on reducing the accuracy rate of the data due to the over fitting and also this may result on improving accuracy rates too. the best fitting iteration have to be check and adjusted.

05. Combining fine tunned model with the trainned model

lstmtraining — stop_training \
— continue_from output/pubg_checkpoint \
— traineddata tesseract/tessdata/sin.traineddata \
— model_output output/pubg.traineddata

06. Re-make eval data for the ‘Impact’ font

lstmeval — model output/pubg_checkpoint \
— traineddata tesseract/tessdata/sin.traineddata \
— eval_listfile train/sin.training_files.txt

07. Copy the trained model into the tesseract/tessdata folder and perform the detection.

Results after trained model

Font level accuracy in the normal and fine tuned model. Selected fonts are Noto Sans Sinhala and LKlug.

Evaluation of Noto Sans Sinhala

Error rate before fine tuning the model (Noto Sans Sinhala)
Error rate after the fine tunning the model (Noto Sans Sinhala)

the training fontsEvaluation of LKLug

Error rate before fine tuning the model (LKLug)
Error rate after fine tuning the model (LKLug)

--

--

Isuri Anuradha

PhD candidate| Research Assistant @ UOC | Graduate from University of Westminster | Trainee Software engineer @ WSO2 | GSoC Participant