QUANTRIUM GUIDES

Installing and using Tesseract 4 on Ubuntu 18.04

Bharath Sivakumar
Quantrium.ai
Published in
8 min readAug 15, 2020

--

Photo by Benjamin Smith on Unsplash

Today, the extraction of information from scanned documents such as letters, write-ups, invoices, etc. has become an integral part of your business processes. To accomplish this task, you need to setup an OCR software to extract the information from these scanned documents or pdfs.

Here we will take you through the process of building and installing Tesseract 4.x on your Ubuntu 18.04 machine. There are two ways to install Tesseract 4.x.:

One is installing the Tesseract 4.0.0 beta version, it is easy to install and can be done using couple of commands.

Alternatively, you can install Tesseract 4.1.1 version, the latest stable release of Tesseract. In this post, we will guide you how to install each one of them on your Ubuntu 18.04 Machine.

If you are not familiar with build tools and building from GitHub repositories, then installing Tesseract 4.0.0 beta is better way for you. However, if you are experienced in building and installing applications from GitHub repositories you can skip the next section and jump directly to section Installing Tesseract 4.1.1.

Installing Tesseract 4.0.0 beta

Installing Tesseract 4.0.0 beta version is quite simple to install and can be done using the following apt commands:

$ sudo apt install tesseract-ocr
$ sudo apt install libtesseract-dev

Once you have run these two commands, check, if you have successfully installed tesseract by running the following command:

$ tesseract --version

After running this command, you should something like this:

tesseract 4.0.0-beta.1 
leptonica-1.75.3

Or something along those lines if your installation was successful. If you it is not installed properly, you will get some errors. That means you have to check for your operating system and versions. These commands work only on Ubuntu 18.04 or higher.

Once your tesseract installation is successful, you can run the following command to check which languages are supported by your installed version of tesseract:

$ tesseract --list-langs

You can expect the following output:

List of available languages (2):
eng
osd

The eng means, it can detect English language and osd refers that it can detect orientation and script.

Well Congratulations! You have successfully installed Tesseract 4.0.0 beta on your system and its ready to use it.

Installing tesseract 4.1.1 on Ubuntu 18.04:

In this section, we take you through the steps to build and install tesseract 4.1.1 from the following tesseract’s GitHub repository:

Before you start building tesseract 4.1.1 from source, you need to install few dependencies. First, you have to install the leptonica library, its a pedagogically-oriented open source library containing software that is broadly useful for image processing and image analysis applications. To know more about leptonica, refer to Leptonica’s website:

http://www.leptonica.org/

To install leptonica, use the following command:

$ sudo apt-get install -y libleptonica-dev

A further list of all the dependencies required by tesseract can be found here:

From this list, most likely you will not have the following dependencies:

automake 
pkg-config
pango-devel
cairo-devel
icu-devel

Your Ubuntu system comes along with gcc which does offer C++11 support hence, its already there. You can use the following commands to install the above dependencies:

$ sudo apt-get update -y
$ sudo apt-get install automake
$ sudo apt-get install -y pkg-config
$ sudo apt-get install -y libsdl-pango-dev
$ sudo apt-get install -y libicu-dev
$ sudo apt-get install -y libcairo2-dev
$ sudo apt-get install bc

The last library bc is an extra dependency that is required to get tesseract 4 running on your machine.

Now you have to clone the tesseract repository. Hey! but stop right there! First, go to the following repository:

And open the file named VERSION, you will see 5.0.0-alpha written, that means the tesseract version that will be installed by using the makefile in this repository will be 5.0.0-alpha. But this is not the stable release of tesseract, the stable release is 4.1.1 at the time of creation of this post.

Now to find the link to download latest stable release of tesseract, in the right side bar you will find a section titled “Releases” and within that you will see 4.1.1 Release.

Tesseract GitHub Repository

Click on the link 4.1.1. Release and there you will find Assets section with Source code (zip) and Source code (tar.gz), copy the link and then download using the following command:

$ wget https://github.com/tesseract-ocr/tesseract/archive/4.1.1.zip

You can download either zip or tar.gz file. Here I have downloaded the zip file. You can unzip the file to your current directory using unzip command:

$ unzip 4.1.1.zip

Upon the completion of unzip operation, a folder titled tesseract-4.1.1 has been created. Get into this directory using cd command.

$ cd tesseract-4.1.1

In this folder if you list the files it should be something like this:

abseil		CONTRIBUTING.md     java	 tessdata
appveyor.yml cppan.yml LICENSE tesseract.pc.cmake
AUTHORS doc m4 tesseract.pc.in
autogen.sh docker-compose.yml Makefile.am test
ChangeLog Dockerfile README.md unittest
cmake googletest snap VERSION
CMakeLists.txt INSTALL src
configure.ac INSTALL.GIT.md sw.cpp

Now you are ready to install tesseract. The different ways and methods to do so for various operating systems are given here below in this link:
https://github.com/tesseract-ocr/tesseract/blob/master/INSTALL.GIT.md
We are going to use the autotools (LINUX/UNIX , msys…) to do so.

You need to run the following commands from the tesseract-4.1.1 directory to install the tesseract:

$ ./autogen.sh
$ ./configure
$ make
$ sudo make install
$ sudo ldconfig
$ make training
$ sudo make training-install

To check that tesseract has been installed successfully, run the following command:

$ tesseract --version

You should see the output something like this:

tesseract 4.1.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE

If the output is not same as the above or you get some error, then try to go back and check again to see where you went wrong or again follow the steps one by one.

The Folder tessdata

Now, the tessdata folder in the tesseract directory is where the tesseract checks for the language data that it needs to perform OCR on the input document.

For tesseract to work, you need at least one language, for English language you need a data file, titled 'eng.traineddata'. Also you will need another file titled 'osd.traineddata', it is used for orientation detection, and is also required in tessdata folder.

Unfortunately, these are not installed by default in this folder when we run make command. You need to download them separately into this folder. You can check the content of the tessdata folder by using ls command:

$ cd tessdata
$ ls

You will see output somewhat similar to following:

configs		   eng.user-words  Makefile.am	pdf.ttf
eng.user-patterns Makefile Makefile.in tessconfigs

As you can see, both the eng.traineddata, and the osd.traineddata are missing. Now download the eng.traineddata and osd.trainedddata from the following link:

You can download them to your local system and then upload them to the tessdata folder or you could download them directly using the wget command:

$ wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata $ wget https://github.com/tesseract-ocr/tessdata/blob/master/osd.traineddata

Once you have successfully downloaded these files, you need to set your TESSDATA_PREFIX environment variable to the location of your tessdata directory. Use the export command to set the variable:

$ export TESSDATA_PREFIX=/content/tesseract-4.1.1/tessdata

Now you can list the languages in your tesseract using the following command:

$ tesseract --list-langs

You can see the output as following:

List of available languages (2):
eng
osd

If you want to use other languages, you can download them to the tessdata folder and start using them.

Using Tesseract from Terminal

Tesseract has a various wrappers, for example, Python wrapper named pytesseract, these wrappers helps you to get access to tesseract using various programming languages. Here, we will be using tesseract through the command line.

To perform OCR on an image you can run the following command on the terminal with the path of image file on which you want to perform OCR:

$ tesseract <path_of_image> stdout

In the above command, the path_of_image is the location of the image that you want to test tesseract with. Once you do so, you should get an output right in the command line that looks something like this:

Here pardit was the text present in my image. So I was able to successfully use tesseract for extracting text out of my image file.

Saving Tesseract Output to a File

If you want to save the output of tesseract to a text file, you can use the following command:

tesseract <path_of_image> output.txt

Here, the output will be stored in output.txt file in your present working directory.

Running Tesseract on Multiple Files

Sometimes we want to extract text out of multiple images or documents. To accomplish this, you can give text file as an input to the Tesseract which contains all the absolute path of the images that you want to perform OCR on, one file in each line.

For Example, let’s you have two photos called handwritten_photo_1.png and handwritten_photo_2.png, with some text in them, in /usr/share/ directory. Let’s create a file named input.txt with the following content:

/usr/share/handwritten_photo_1.png
/usr/share/handwritten_photo_2.png

And you want to store the contents of the these two handwritten photos in a text file, say output.txt. You have to run the following command:

$ tesseract input.txt output.txt

output.txt will have the OCR contents of both handwritten_photo_1.png and handwritten_photo_2.png, in that order. When you open and view the content of the output.txt, you will see that the extracted lines are preceded by some symbol like this:

Tesseract output of an input text file with 5 lines of image locations

So in this case, Viral Calic is the prediction for the first image, CY am the king of the world the prediction for the second image, Com and Serr the prediction for the third image and so on.

You can explore further on the usage of the tesseract on the following two links:

I hope you were able to follow the guide and were able to install and use Tesseract on your Ubuntu 18.04 machine.

--

--

Bharath Sivakumar
Quantrium.ai

A Machine Learning enthusiast who wants to make Machine Learning tools accessible to everybody