Tesseract OCR With JNA Wrapper

Charangan Vasantharajan
The Startup
Published in
9 min readDec 3, 2020

--

What is OCR?

Before going to study Tesseract, we need to know what OCR(Optical Character Recognition) is. In simple words, OCR is the process of extracting the plain text or hOCR from the given document, such as images(like JPG, JPEG, TIFF) and PDFs. After locating the text in the file, OCR will read and recognize the text character in the subsequent steps by using OCR engines and return it. This is how OCR helps machines to identify the text from the given file.

Introduction to Tesseract

Tesseract is an open-source text recognition (OCR) Engine written in c/c++ and works on Windows, macOS, and Linux, and comes under Apache 2.0 License. It was initially designed by Hewlett Packard in 1985 then later released as an Open Source in 2005. After that, Google sponsored to develop and maintain Tesseract from 2006.

It can be used directly using an API to extract printed text from files, has Unicode(UTF-8) support, and recognizes more than 100 languages which you can refer to here. Tesseract project does not have a built-in GUI application, if you need GUI, find one from several available 3rdParties such as VietOCR, OCR2Text, dpScreenOCR, NeOCR, etc. You can get more about 3rdParties here.

Developers can use the Tesseract API to build their own application under a provided license. If you need to use Tesseract with other programming languages, you need to use Tesseract wrappers. I am going to bind Tesseract with Java in my example, so I prefer to use Tess4J as my JNA wrapper.

Latest Release — Tesseract 4.1.1

Migrating from version 3 to 4.0x+, Tesseract added a new OCR engine based on LSTM neural networks for text recognition, which provides better working experience on x86/Linux, improved accuracy with higher speed, and provides output in different formats such as, plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV.

The latest version of Tesseract source code is available from the master branch on GitHub.

Tesseract Wrapper — Tess4J

Tess4J is a Java JNA wrapper for Tesseract OCR API released and licensed under Apache 2.0 and is also available from SourceForge(Maven Central Repository). The library provides OCR support for TIFF, JPEG, GIF, PNG, and BMP image formats, Multi-page TIFF images, and PDF document format.

See Tess4J Documentation — Link to Tess4J v4.4 which I am going to use in my example.

Tesseract Competitors

source: https://images.app.goo.gl/R9eo161fYLs3Z3uz5

ABBYY FineReader - It was developed at ABBYY in 1993

Features:

  1. Supports more than 190 languages
  2. Outputs to MS Office, RTF, HTML, searchable PDF, and plain text
  3. Can integrate through Java SDK
  4. Custom training supported
  5. Accuracy 0.9 out of 10
source: https://images.app.goo.gl/WbX14NhQuqhajMUk7

Google Cloud Vision API -This API was launched in 2016 by Google

Features:

  1. Supports more than 50 languages
  2. Google Cloud Vision API renders their outputs as JSON
  3. Can integrate through REST API, and can integrate with Google Images and Google SafeSearch
  4. Custom training not supported
  5. Accuracy 0.8 out of 10

Setup a Linux Environment for Tesseract

There are two steps to set up the environment. First, we want to install the Tesseract OCR engine and trained data files installation.

Tesseract Installation

The Tesseract OCR engine package is generally called “tesseract-ocr” Thus you can install the latest version of Tesseract 4.1.1 through the terminal and its developer tools on Ubuntu Focal 20.04.

Install Tesseract OCR Engine

And also we need to install the command line program “libtesseract-dev” to work with the Tesseract OCR engine.

Install Command Line Program fro Tesseract

Then we need to install tesseract built binaries(supported languages and scripts) that are available directly from the Linux distributions through snapd by running the following command. If you do not install snapd, you have to run the below command before installing tesseract built binaries.

Install Snapd

Then run

Install Tesseract Built Binaries

You can find other installation methods for various Operating systems here.

Traineddata Installation

Tesseract contains a new neural network-based recognition engine that requires significantly more training data to deliver significantly higher accuracy contents. So we have to train our neural network application to return a better engine with higher accuracy. Normally, it takes a few days to a couple of weeks. So, I am going to choose to use an existing traineddata set that is trained on about 400000 text lines spanning about 4500 fonts.

There are three sets of traineddata files compatible with Tesseract 4.0x+, but I choose tessdata for my installation due to its support for the legacy recognizer. This traineddata is faster than other traineddata sets and has better accuracy. Below I provide some language links to get their traineddata.

English: Download

Tamil: Download

Sinhala: Download

You can also get other languages’ traineddata sets from here.

We have just finished the installation part!

Let’s create a simple Java Project in IntelliJ IDEA

Create a new project based on Maven and create an empty class. After creating the class my folder structure will be like this.

Creating a Project based on Maven
Folder Structure after creating the project

Modify your project’s pom.xml file to add the below dependency under dependencies element in pom.xml file to enable Maven.

Add Maven Dependency to pom.xml

After adding the above dependency, my pom.xml file will be like this.

Modified pom.xml

Once we have an empty class, we can start adding some code to it. Here is the import statement for the instantiation of the Tesseract object.

Import Tess4J for our Project File

and also we need to import the below statement to avoid exceptions rising while recognizing the text.

Import Tess4J Exception

I am going to use files for recognition which are on my local computer. So I need to import the below statement to handle files.

Import Java File Lib

Now, my IntelliJ IDEA view is

After all, packages imported

I will define a static method for Tesseract inside the class and Inside this method, make a new instance of Tesseract from the Maven library.

New Instance of Tesseract from the Maven Library

Next, I will add traineddata details to this instance like where the training data for this library can be found. As I mentioned earlier, I have downloaded the traineddata for English and stored it on my Desktop. So add the below two lines of code after instance creation inside the method.

Note: According to your traineddata file’s directory, the path will change.

Set Traineddaata Path

Finally, return the instance. Now my screen will be:

Return Plain Text

The above method returns the text as plain text from the resource. If you want to return the result as an HTML, you need to tell Tesseract that the output we need is in the format something called the hOCR(HTML). Basically, the hOCR format is a simple XML-based format.

We can make it an hOCR format by adding the below statement above the return statement.

Add hOCR Output Method
Return hOCR Format

Finally, I put the main function below to make it usable and call the static method inside it.

Tesseract Main Method

Now, what we have to do is provide a file to Tesseract which it can parse and read its Text. In this example, I am going to try with an image that is stored in my Ubuntu Desktop.

So, after knowing about the input file path, creating a new file instance, and setting the file path, I also need to pass the file to tesseract for recognition and write a java print statement to get the output. Finally, your code view will be:

With the main function

Since I settled all things correctly, still IntelliJ IDEA shows some errors in my files. It’s nothing but, we have to reload the project to load all dependency packages and sources.

Right-click on pom.xml file -> Maven -> Reload Project

Now, the problem is solved.

No errors after reloading the project

If we look closely, there is nothing we did much. That is the power of this wrapper for the Tesseract library I am provided with. Now we are ready to run our sample OCR program.

Note: Confirm that your project is set to the latest version of Java like Java 11. If you get an error regarding the Java version, you can simply set it to the latest version.

Go File -> Project Structure.

You will see a dialog box given below, Make sure that the Java version under Project and Modules section is set to the latest version of Java like Java 11.

Project Setting -> Project

Project

Project Setting -> Modules

Module

And also make sure that the Java Compiler setting also by pressing Ctrl + Alt + S. You will popup a dialog box, then navigate to Java Compiler as given below and check if the target bytecode version is set above to 1.5 for your project.

Target bytecode

Fine. Now the time to take a test drive through our sample application, so I am going to use google’s logo as my sample input.

Note: Try to avoid PNG images and use JPG/JPEG images instead if you’re working with images at all. This is because Tesseract is very bad at reading PNG images due to its compression techniques.

charangan tesseract ocr project
source: https://www.putnamlib.org/images/google.jpg/@@images/image.jpeg

Once I run the project, I get this simple output as:

Output

For your reference, all the source code of the sample application is available on my Github.

Limitations of Tesseract

Since Tesseract is an OCR engine, it works well for only clean foreground text. For an instance, if we take an image which is having noise in the background, then we could not get an accurate result in the OCR test. So, there are many sorts of reasons for not getting good quality output from Tesseract. The image which has minimum requirements for its size, contrast, and lightning, recognizes better in the test. The lower quality images require preprocessing to improve the recognization results, such as scale appropriately, convert it to much contrast as possible, and the converted text to be horizontally aligned. Except for the following limitations, Tesseract OCR is a powerful OCR engine.

  • Lower quality recognizations may output poor quality OCR results.
  • The results do not have information about the text font family.
  • It doesn’t work with files that having artifacts including partial occlusion, distorted perspective, and complex background.
  • Poor in recognize images which having handwriting letters.
  • When the input file contains languages other than the given in set function as an argument(shown below in the example), results in poor quality results.
  • Two-column documents fail in the OCR test. That’s why it is not always best at recognizing the natural follow of input files and the result text will seem like the joined text across columns.

Conclusion

In this article, we have studied a lot of things about Tesseract, Tess4J, and some other sort of things and we have made a very simple Tesseract OCR engine which allows us to read text from various format files like PDF and image files(except PNG).

In addition to this, we have to pre-process the file before recognizing the input file to overcome the Tesseract Limitations and get better results. So, I will write about Tesseract Pre-Processing in my next article.

--

--

Charangan Vasantharajan
The Startup

MASc @ McMaster University | Former Research Intern @ NTU, Singapore