Tesseract OCR With JNA Wrapper

Image for post
Image for post

What is OCR?

Before going to study Tesseract, we need to know what OCR(Optical Character Recognition) is. In simple words, OCR is the process of extracting the plain text or hOCR from the given document, such as images(like JPG, JPEG, TIFF) and PDFs. After locating the text in the file, OCR will read and recognize the text character in the subsequent steps by using OCR engines and return it. This is how OCR helps machines to identify the text from the given file.

Introduction to Tesseract

Tesseract is an open-source Engine written in c/c++ and works on Windows, macOS, and Linux, and comes under . It was initially designed by Hewlett Packard in 1985 then later released as an Open Source in 2005. After that, Google sponsored to develop and maintain Tesseract from 2006.

It can be used directly using an API to extract printed text from files, has Unicode(UTF-8) support, and recognizes more than 100 languages which you can refer to . Tesseract project does not have a built-in GUI application, if you need GUI, find one from several available 3rdParties such as VietOCR, OCR2Text, dpScreenOCR, NeOCR, etc. You can get more about 3rdParties .

Developers can use the Tesseract API to build their own application under a provided license. If you need to use Tesseract with other programming languages, you need to use Tesseract . I am going to bind Tesseract with Java in my example, so I prefer to use as my JNA wrapper.

Latest Release — Tesseract 4.1.1

Migrating from version 3 to 4.0x+, Tesseract added a new OCR engine based on neural networks for text recognition, which provides better working experience on x86/Linux, improved accuracy with higher speed, and provides output in different formats such as, plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV.

The latest version of Tesseract source code is available from the on GitHub.

Tesseract Wrapper — Tess4J

is a Java JNA wrapper for Tesseract OCR API released and licensed under and is also available from SourceForge(). The library provides OCR support for TIFF, JPEG, GIF, PNG, and BMP image formats, Multi-page TIFF images, and PDF document format.

Tess4J Documentation — Link to Tess4J v4.4 which I am going to use in my example.

Tesseract Competitors

Image for post
Image for post
source: https://images.app.goo.gl/R9eo161fYLs3Z3uz5

It was developed at ABBYY in 1993

Features:

  1. Supports more than 190 languages
  2. Outputs to MS Office, RTF, HTML, searchable PDF, and plain text
  3. Can integrate through Java SDK
  4. Custom training supported
  5. Accuracy 0.9 out of 10
Image for post
Image for post
source: https://images.app.goo.gl/WbX14NhQuqhajMUk7

This API was launched in 2016 by Google

Features:

  1. Supports more than 50 languages
  2. Google Cloud Vision API renders their outputs as JSON
  3. Can integrate through REST API, and can integrate with Google Images and Google SafeSearch
  4. Custom training not supported
  5. Accuracy 0.8 out of 10

Setup a Linux Environment for Tesseract

There are two steps to set up the environment. First, we want to install the Tesseract OCR engine and trained data files installation.

Tesseract Installation

The Tesseract OCR engine package is generally called “ Thus you can install the latest version of Tesseract 4.1.1 through the terminal and its developer tools on Ubuntu Focal 20.04.

Install Tesseract OCR Engine

And also we need to install the command line program “ to work with the Tesseract OCR engine.

Install Command Line Program fro Tesseract

Then we need to install tesseract built binaries(supported languages and scripts) that are available directly from the Linux distributions through by running the following command. If you do not install , you have to run the below command before installing tesseract built binaries.

Install Snapd

Then run

Install Tesseract Built Binaries

You can find other installation methods for various Operating systems .

Traineddata Installation

Tesseract contains a new neural network-based recognition engine that requires significantly more training data to deliver significantly higher accuracy contents. So we have to train our neural network application to return a better engine with higher accuracy. Normally, it takes a few days to a couple of weeks. So, I am going to choose to use an existing traineddata set that is trained on about 400000 text lines spanning about 4500 fonts.

There are three sets of traineddata files compatible with Tesseract 4.0x+, but I choose for my installation due to its support for the legacy recognizer. This traineddata is faster than other traineddata sets and has better accuracy. Below I provide some language links to get their traineddata.

English:

Tamil:

Sinhala:

You can also get other languages’ traineddata sets from .

We have just finished the installation part!

Let’s create a simple Java Project in IntelliJ IDEA

Create a new project based on Maven and create an empty class. After creating the class my folder structure will be like this.

Image for post
Image for post
Creating a Project based on Maven
Image for post
Image for post
Folder Structure after creating the project

Modify your project’s pom.xml file to add the below dependency under dependencies element in pom.xml file to enable Maven.

Add Maven Dependency to pom.xml

After adding the above dependency, my pom.xml file will be like this.

Image for post
Image for post
Modified pom.xml

Once we have an empty class, we can start adding some code to it. Here is the import statement for the instantiation of the Tesseract object.

Import Tess4J for our Project File

and also we need to import the below statement to avoid exceptions rising while recognizing the text.

Import Tess4J Exception

I am going to use files for recognition which are on my local computer. So I need to import the below statement to handle files.

Import Java File Lib

Now, my IntelliJ IDEA view is

Image for post
Image for post
After all, packages imported

I will define a static method for Tesseract inside the class and Inside this method, make a new instance of Tesseract from the Maven library.

New Instance of Tesseract from the Maven Library

Next, I will add traineddata details to this instance like where the training data for this library can be found. As I mentioned earlier, I have downloaded the traineddata for English and stored it on my Desktop. So add the below two lines of code after instance creation inside the method.

: According to your traineddata file’s directory, the path will change.

Set Traineddaata Path

Finally, return the instance. Now my screen will be:

Image for post
Image for post
Return Plain Text

The above method returns the text as plain text from the resource. If you want to return the result as an HTML, you need to tell Tesseract that the output we need is in the format something called the hOCR(HTML). Basically, the hOCR format is a simple XML-based format.

We can make it an hOCR format by adding the below statement above the return statement.

Add hOCR Output Method
Image for post
Image for post
Return hOCR Format

Finally, I put the main function below to make it usable and call the static method inside it.

Tesseract Main Method

Now, what we have to do is provide a file to Tesseract which it can parse and read its Text. In this example, I am going to try with an image that is stored in my Ubuntu Desktop.

So, after knowing about the input file path, creating a new file instance, and setting the file path, I also need to pass the file to tesseract for recognition and write a java print statement to get the output. Finally, your code view will be:

Image for post
Image for post
With the main function

Since I settled all things correctly, still IntelliJ IDEA shows some errors in my files. It’s nothing but, we have to reload the project to load all dependency packages and sources.

Right-click on

Now, the problem is solved.

Image for post
Image for post
No errors after reloading the project

If we look closely, there is nothing we did much. That is the power of this wrapper for the Tesseract library I am provided with. Now we are ready to run our sample OCR program.

: Confirm that your project is set to the latest version of Java like Java 11. If you get an error regarding the Java version, you can simply set it to the latest version.

.

You will see a dialog box given below, Make sure that the Java version under Project and Modules section is set to the latest version of Java like Java 11.

Image for post
Image for post
Project

Image for post
Image for post
Module

And also make sure that the Java Compiler setting also by pressing Ctrl + Alt + S. You will popup a dialog box, then navigate to Java Compiler as given below and check if the target bytecode version is set above to 1.5 for your project.

Image for post
Image for post
Target bytecode

Fine. Now the time to take a test drive through our sample application, so I am going to use google’s logo as my sample input.

: Try to avoid PNG images and use JPG/JPEG images instead if you’re working with images at all. This is because Tesseract is very bad at reading PNG images due to its compression techniques.

charangan tesseract ocr project
charangan tesseract ocr project
source: https://www.putnamlib.org/images/google.jpg/@@images/image.jpeg

Once I run the project, I get this simple output as:

Image for post
Image for post
Output

For your reference, all the source code of the sample application is available on my .

Limitations of Tesseract

Since Tesseract is an OCR engine, it works well for only clean foreground text. For an instance, if we take an image which is having noise in the background, then we could not get an accurate result in the OCR test. So, there are many sorts of reasons for not getting good quality output from Tesseract. The image which has minimum requirements for its size, contrast, and lightning, recognizes better in the test. The lower quality images require preprocessing to improve the recognization results, such as scale appropriately, convert it to much contrast as possible, and the converted text to be horizontally aligned. Except for the following limitations, Tesseract OCR is a powerful OCR engine.

  • Lower quality recognizations may output poor quality OCR results.
  • The results do not have information about the text font family.
  • It doesn’t work with files that having artifacts including partial occlusion, distorted perspective, and complex background.
  • Poor in recognize images which having handwriting letters.
  • When the input file contains languages other than the given in set function as an argument(shown below in the example), results in poor quality results.
  • Two-column documents fail in the OCR test. That’s why it is not always best at recognizing the natural follow of input files and the result text will seem like the joined text across columns.

In this article, we have studied a lot of things about , , and some other sort of things and we have made a very simple Tesseract OCR engine which allows us to read text from various format files like PDF and image files(except PNG).

In addition to this, we have to pre-process the file before recognizing the input file to overcome the and get better results. So, I will write about Tesseract Pre-Processing in my next article.

The Startup

Medium's largest active publication, followed by +755K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store