Introduction to Tesseract OCR

Published in

The Zeals Tech Blog

9 min readDec 5, 2020

Hi this is Day 5 of Zeals Advent Calendar 2020. Details of the Advent Calendar are written here. https://medium.com/zeals-tech-blog/zeals-engineers-are-going-to-host-an-advent-calendar-4c94ad46575d. Please check out and find the others interesting story there!

Hi everyone, my name is Bismo, working as Backend Engineer in Zeals. At Zeals, I am mostly taking care about microservices. In this article, I want to share my experience having fun with an engine. What engine? Let’s jump into it!!

First of all, have you ever had experience moving text on your documents to editable text format? Doing it manually would take a lot of time and effort. We need something to help the process efficiently. And here’s an article to solve that problem, playing with OCR!!

Introduction

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast). [source]

To better understand how OCR works, see the diagram process in the following picture. From end user side, OCR process is very simple just processing the image and will get the editable text.

Library

There are various OCR tools, not only from paid services (Google, Amazon, Azure, etc) but also from open source library, one of them is Tesseract. In this playground, we will have some experiments using Tesseract engine to do multiple case extracting text from text image based. But wait, what is Tesseract? Sounds like object on Avengers Movie :)

Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006. [source]

Tesseract has 37.4k stars +6.9k fork (28 Nov 2020) on their github and still maintained. That’s a good point why we should try this engine.

Installation

Follow the official Tesseract github page to install the package on your system. Once you have installed the package successfully, you will be able to run tesseract command on your terminal (I’m using Mac).

$ tesseract
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

In this article, my Tesseract version is 4.1.1 . You can check the version by type tesseract --version command.

$ tesseract --version
tesseract 4.1.1
 leptonica-1.80.0
  libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

Supported File

Based on the information I found and my personal experience testing every file, this is the supported type from Tesseract that could be read by their engine:

JPG
PNG
GIF
PNM
TIFF

Unfortunately, Tesseract engine can’t read PDF file. I’ve tested a PDF file but the output getting error like message below.

Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read

For PDF file case, we need to convert it to supported files above before extracting it using Tesseract.

Testing

When our Tesseract environment is ready, let’s try to test with this simple Lorem ipsum text image. As we can see, the image quality is pretty good and clear. This condition should be not give any issue when we want to extracting to editable text.

Extracting the text using Tesseract is quite simple. We just need to run tesseract IMAGE_PATH OUTPUT. Here is the explanation about the command.

tesseract : Main command.
IMAGE_PATH : The location of the image parameter.
OUTPUT : Output parameter. We can use stdout to getting output in the terminal. Or /path/to/txt to getting txt file.

Stdout output will be look like this.

$ tesseract lorem.jpg stdout
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
Lorem ipsum dignis eium fugit aspelluptat eati odis
net exeri rectem lia venihil icipsapid qui dolupient quam
aceatemque repedi tem lantiae provid quia sitatia temqui-
buste voluptatquo comnit fugiat invenit fugia quo ditias ipi-
tatat erspel id utesecuptur solorio. Hari veria dis nis et millic
tota comnimet inctor sum aut laboressedis deribus illore non
et fugitiosam, soluptat in eatur? Endignatem el et ex endicim
re occullu ptatem laut audipita num fugit adis delenti cum
sus aut iduntur arumqui blam eos molorum quissimint, nis ut
aut adisci odic te as everspe ditati accum alit rempel iumque
nobis repudamus.......Lupieniendis aut volorerio. Daectot aectestium latem
volenimus ut velis alibus ulliciis aceaquam derovid elitatet
que vel minctume iusam dolupti venis am fugit etus vellit re
viducid quiatquiant volestisti torehen tiorestem antis militas
nes del ilicaturibus et et est harum, ipsant, natem quos es
ipsa velit est, es re volupta temolorum este explant.Pore vid est, audam facia voluptiae pos ut que nullo-
ria core nihilita istio tem quiscia volore nulliae corpos eatur

Generating Txt file just have an output message similar with the stdout command on the terminal, besides that we also found the txt file on the path we defined.

$ tesseract lorem.jpg /Users/momo/Desktop/lorem
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376

Picture 4 Lorem Ipsum Extracting Txt Output

As the output, the text extracted from the image is perfectly correct!! Then, how about if the image is scanned from real document? I have the document in the following picture. There are a few noise found on the image. We will check the editable text will have a good result as well or not.

$ tesseract toefl_score.jpg stdout
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 435
Centre for Language Development
Institute of Educational Development and Quality Assurance
Yogyakarta State UniversityProTEFL Score Report
No. 1553.b/M/P2B-LPPMP.UN Y/ViJ/2015

Wow, from 171 characters, just missed around 3 characters or we got 98% correct result!! It is normal because from the picture above, the UNY/VII isn’t too clear and made ambiguous for the engine.

Non-Latin Scripts

Another concern about extracting the text is non-Latin scripts issue. Mostly, we use Latin when writing some text. But how about the country that rarely using Latin, for the example Japanese, Korean and Chinese?

In their documentation, Tesseract support extracting text with language option. Then, we need to check the list of languages that we have within this command tesseract — list-langs.

$ tesseract --list-langs
List of available languages (3):
eng
osd
snum

If the language not available on our list, we should added it. For the example we will try to extract Japanese text. Because I’m Mac user then I will using brew command to add more languages;brew install tesseract-lang . After the installation finished, I should be available to see jpn in tesseract --list-langs .

Let’s get non-Latin scripts experiment!! This is a simple Japanese characters and we will try to extract it.

If we just use the standard command like previous experiment, we will get the wrong result.

$ tesseract arigatou.jpg stdout
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 556
Detected 9 diacritics
HYUPESCCWOES

How we fix it? Actually the command is quite similar, we just need to put language parameter in the end of command; tesseract arigatou.jpg stdout -l jpn . And we successfully getting the Japanese characters!!

$ tesseract arigatou.jpg stdout -l jpn
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 556
Detected 9 diacritics
ありがとうございます

Next case, how if the text come with mixed both Latin and non-Latin scripts? Here is the example.

Picture 7 Latin and Non Latin Characters

Sure, we still can do it by mixed the language as well!! Just need to put the another language with + like this tesseract arigatou.jpg stdout -l jpn+eng

$ tesseract arigatou.jpg stdout -l jpn+eng
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 379
Detected 9 diacritics
ありがとうございますarigatou gozaimasu: thank you

Result

We will have several experiments with various condition of the image to make sure how’s Tesseract engine works properly. In the end, we getting this matrix result. Sorry I can’t attach the images due to personal info reason.

From the result above, we have temporary conclusion that the result really depends on the scanned document condition. Also what the style from the document.

Speed

Then how fast Tesseract execution time? We will do various cases to measure what the point that make it faster/slower. In this execution, I will use trap command to show the time on each command.

$ trap 'echo -e "\nStarted at: $(date)\n"' DEBUG
$ pwdStarted at: Wed Dec  2 10:13:19 WIB 2020/Users/momo

As we can see, timestamp will be prepending on each command line. We need to modify the command (a little bit tricky) to know what exactly the time it’s need.

$ trap 'echo -e "\nStarted at: $(date)\n"' DEBUG
$ tesseract arigatou.jpg stdout -l jpn && echo "finished"Started at: Wed Dec  2 10:33:36 WIB 2020Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 556
Detected 9 diacritics
ありがとうございますStarted at: Wed Dec  2 10:33:36 WIB 2020finished

From the result above, we can notice that execution time is <1s. Because of both first and second commands started in same time. Finally, after several speed test, we are getting this result.

Maybe isn’t 100% valid point that made execution time fast or slow, but size and character type gave some value that affected the time. For the example, 1st and 2nd try is same image just different on size, but the bigger size will take more time consuming compared to lower size.

Another case, with same character counted and almost equal size, Japanese character (non-Latin script) will take around 6x time that Latin script.

FYI, I’am using this MacBook specification when doing the test.

MacBook Pro (13-inch, 2017, Two Thunderbolt 3 ports)
Processor 2,3 GHz Dual-Core Intel Core i5
Memory 8 GB 2133 MHz LPDDR3
macOS Catalina Version 10.15.7 (19H15)

Programming Language Package

How to use Tesseract engine on specific programming language? Actually, a lot of library from multiple languages that working with Tesseract for the example.

We can refer to the library above based on our programming language that we used to simplify the implementation.

Conclusion

After having fun with Tesseract OCR, I can say that the engine is amazing!! Here the list of interesting point from Tesseract in my opinion:

Open Source.
Easy to use.
Good extract result.
Support multi language (Latin & Non-Latin).

If you facing some issues and think OCR as your solution, Tesseract would be nice to try! I hope this article is useful for you, thank you!!