QUANTRIUM GUIDES
Installing and using Tesseract 4 on Ubuntu 18.04
Today, the extraction of information from scanned documents such as letters, write-ups, invoices, etc. has become an integral part of your business processes. To accomplish this task, you need to setup an OCR software to extract the information from these scanned documents or pdfs.
Here we will take you through the process of building and installing Tesseract 4.x on your Ubuntu 18.04 machine. There are two ways to install Tesseract 4.x.:
One is installing the Tesseract 4.0.0 beta version, it is easy to install and can be done using couple of commands.
Alternatively, you can install Tesseract 4.1.1 version, the latest stable release of Tesseract. In this post, we will guide you how to install each one of them on your Ubuntu 18.04 Machine.
If you are not familiar with build tools and building from GitHub repositories, then installing Tesseract 4.0.0 beta is better way for you. However, if you are experienced in building and installing applications from GitHub repositories you can skip the next section and jump directly to section Installing Tesseract 4.1.1.
Installing Tesseract 4.0.0 beta
Installing Tesseract 4.0.0 beta version is quite simple to install and can be done using the following apt commands:
$ sudo apt install tesseract-ocr
$ sudo apt install libtesseract-dev
Once you have run these two commands, check, if you have successfully installed tesseract by running the following command:
$ tesseract --version
After running this command, you should something like this:
tesseract 4.0.0-beta.1
leptonica-1.75.3
Or something along those lines if your installation was successful. If you it is not installed properly, you will get some errors. That means you have to check for your operating system and versions. These commands work only on Ubuntu 18.04 or higher.
Once your tesseract installation is successful, you can run the following command to check which languages are supported by your installed version of tesseract:
$ tesseract --list-langs
You can expect the following output:
List of available languages (2):
eng
osd
The eng
means, it can detect English language and osd
refers that it can detect orientation and script.
Well Congratulations! You have successfully installed Tesseract 4.0.0 beta on your system and its ready to use it.
Installing tesseract 4.1.1 on Ubuntu 18.04:
In this section, we take you through the steps to build and install tesseract 4.1.1 from the following tesseract’s GitHub repository:
Before you start building tesseract 4.1.1 from source, you need to install few dependencies. First, you have to install the leptonica
library, its a pedagogically-oriented open source library containing software that is broadly useful for image processing and image analysis applications. To know more about leptonica
, refer to Leptonica’s website:
To install leptonica
, use the following command:
$ sudo apt-get install -y libleptonica-dev
A further list of all the dependencies required by tesseract can be found here:
From this list, most likely you will not have the following dependencies:
automake
pkg-config
pango-devel
cairo-devel
icu-devel
Your Ubuntu system comes along with gcc
which does offer C++11 support hence, its already there. You can use the following commands to install the above dependencies:
$ sudo apt-get update -y
$ sudo apt-get install automake
$ sudo apt-get install -y pkg-config
$ sudo apt-get install -y libsdl-pango-dev
$ sudo apt-get install -y libicu-dev
$ sudo apt-get install -y libcairo2-dev
$ sudo apt-get install bc
The last library bc
is an extra dependency that is required to get tesseract 4 running on your machine.
Now you have to clone the tesseract repository. Hey! but stop right there! First, go to the following repository:
And open the file named VERSION, you will see 5.0.0-alpha written, that means the tesseract version that will be installed by using the makefile in this repository will be 5.0.0-alpha. But this is not the stable release of tesseract, the stable release is 4.1.1 at the time of creation of this post.
Now to find the link to download latest stable release of tesseract, in the right side bar you will find a section titled “Releases” and within that you will see 4.1.1 Release.
Click on the link 4.1.1. Release and there you will find Assets section with Source code (zip
) and Source code (tar.gz
), copy the link and then download using the following command:
$ wget https://github.com/tesseract-ocr/tesseract/archive/4.1.1.zip
You can download either zip
or tar.gz
file. Here I have downloaded the zip
file. You can unzip the file to your current directory using unzip
command:
$ unzip 4.1.1.zip
Upon the completion of unzip operation, a folder titled tesseract-4.1.1 has been created. Get into this directory using cd
command.
$ cd tesseract-4.1.1
In this folder if you list the files it should be something like this:
abseil CONTRIBUTING.md java tessdata
appveyor.yml cppan.yml LICENSE tesseract.pc.cmake
AUTHORS doc m4 tesseract.pc.in
autogen.sh docker-compose.yml Makefile.am test
ChangeLog Dockerfile README.md unittest
cmake googletest snap VERSION
CMakeLists.txt INSTALL src
configure.ac INSTALL.GIT.md sw.cpp
Now you are ready to install tesseract
. The different ways and methods to do so for various operating systems are given here below in this link:
https://github.com/tesseract-ocr/tesseract/blob/master/INSTALL.GIT.md
We are going to use the autotools (LINUX/UNIX , msys…) to do so.
You need to run the following commands from the tesseract-4.1.1 directory to install the tesseract:
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install
$ sudo ldconfig
$ make training
$ sudo make training-install
To check that tesseract has been installed successfully, run the following command:
$ tesseract --version
You should see the output something like this:
tesseract 4.1.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2
Found AVX
Found FMA
Found SSE
If the output is not same as the above or you get some error, then try to go back and check again to see where you went wrong or again follow the steps one by one.
The Folder tessdata
Now, the tessdata
folder in the tesseract directory is where the tesseract checks for the language data that it needs to perform OCR on the input document.
For tesseract
to work, you need at least one language, for English language you need a data file, titled 'eng.traineddata'
. Also you will need another file titled 'osd.traineddata'
, it is used for orientation detection, and is also required in tessdata
folder.
Unfortunately, these are not installed by default in this folder when we run make
command. You need to download them separately into this folder. You can check the content of the tessdata
folder by using ls
command:
$ cd tessdata
$ ls
You will see output somewhat similar to following:
configs eng.user-words Makefile.am pdf.ttf
eng.user-patterns Makefile Makefile.in tessconfigs
As you can see, both the eng.traineddata
, and the osd.traineddata
are missing. Now download the eng.traineddata
and osd.trainedddata
from the following link:
You can download them to your local system and then upload them to the tessdata
folder or you could download them directly using the wget
command:
$ wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata $ wget https://github.com/tesseract-ocr/tessdata/blob/master/osd.traineddata
Once you have successfully downloaded these files, you need to set your TESSDATA_PREFIX
environment variable to the location of your tessdata
directory. Use the export
command to set the variable:
$ export TESSDATA_PREFIX=/content/tesseract-4.1.1/tessdata
Now you can list the languages in your tesseract
using the following command:
$ tesseract --list-langs
You can see the output as following:
List of available languages (2):
eng
osd
If you want to use other languages, you can download them to the tessdata
folder and start using them.
Using Tesseract from Terminal
Tesseract has a various wrappers, for example, Python wrapper named pytesseract
, these wrappers helps you to get access to tesseract
using various programming languages. Here, we will be using tesseract through the command line.
To perform OCR on an image you can run the following command on the terminal with the path of image file on which you want to perform OCR:
$ tesseract <path_of_image> stdout
In the above command, the path_of_image is the location of the image that you want to test tesseract with. Once you do so, you should get an output right in the command line that looks something like this:
Here pardit
was the text present in my image. So I was able to successfully use tesseract for extracting text out of my image file.
Saving Tesseract Output to a File
If you want to save the output of tesseract to a text file, you can use the following command:
tesseract <path_of_image> output.txt
Here, the output will be stored in output.txt
file in your present working directory.
Running Tesseract on Multiple Files
Sometimes we want to extract text out of multiple images or documents. To accomplish this, you can give text file as an input to the Tesseract which contains all the absolute path of the images that you want to perform OCR on, one file in each line.
For Example, let’s you have two photos called handwritten_photo_1.png
and handwritten_photo_2.png
, with some text in them, in /usr/share/
directory. Let’s create a file named input.txt
with the following content:
/usr/share/handwritten_photo_1.png
/usr/share/handwritten_photo_2.png
And you want to store the contents of the these two handwritten photos in a text file, say output.txt
. You have to run the following command:
$ tesseract input.txt output.txt
output.txt
will have the OCR contents of both handwritten_photo_1.png
and handwritten_photo_2.png
, in that order. When you open and view the content of the output.txt
, you will see that the extracted lines are preceded by some symbol like this:
So in this case, Viral Calic
is the prediction for the first image, CY am the king of the world
the prediction for the second image, Com and Serr
the prediction for the third image and so on.
You can explore further on the usage of the tesseract on the following two links:
I hope you were able to follow the guide and were able to install and use Tesseract on your Ubuntu 18.04 machine.