What is OCR?
Before going to study Tesseract, we need to know what OCR(Optical Character Recognition) is. In simple words, OCR is the process of extracting the plain text or hOCR from the given document, such as images(like JPG, JPEG, TIFF) and PDFs. After locating the text in the file, OCR will read and recognize the text character in the subsequent steps by using OCR engines and return it. This is how OCR helps machines to identify the text from the given file.
Introduction to Tesseract
Tesseract is an open-source text recognition (OCR) Engine written in c/c++ and works on Windows, macOS, and Linux, and comes under Apache 2.0 License. It was initially designed by Hewlett Packard in 1985 then later released as an Open Source in 2005. After that, Google sponsored to develop and maintain Tesseract from 2006.
It can be used directly using an API to extract printed text from files, has Unicode(UTF-8) support, and recognizes more than 100 languages which you can refer to here. Tesseract project does not have a built-in GUI application, if you need GUI, find one from several available 3rdParties such as VietOCR, OCR2Text, dpScreenOCR, NeOCR, etc. You can get more about 3rdParties here.
Developers can use the Tesseract API to build their own application under a provided license. If you need to use Tesseract with other programming languages, you need to use Tesseract wrappers. I am going to bind Tesseract with Java in my example, so I prefer to use Tess4J as my JNA wrapper.
Latest Release — Tesseract 4.1.1
Migrating from version 3 to 4.0x+, Tesseract added a new OCR engine based on LSTM neural networks for text recognition, which provides better working experience on x86/Linux, improved accuracy with higher speed, and provides output in different formats such as, plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV.
The latest version of Tesseract source code is available from the master branch on GitHub.
Tesseract Wrapper — Tess4J
Tess4J is a Java JNA wrapper for Tesseract OCR API released and licensed under Apache 2.0 and is also available from SourceForge(Maven Central Repository). The library provides OCR support for TIFF, JPEG, GIF, PNG, and BMP image formats, Multi-page TIFF images, and PDF document format.
See Tess4J Documentation — Link to Tess4J v4.4 which I am going to use in my example.
ABBYY FineReader - It was developed at ABBYY in 1993
- Supports more than 190 languages
- Outputs to MS Office, RTF, HTML, searchable PDF, and plain text
- Can integrate through Java SDK
- Custom training supported
- Accuracy 0.9 out of 10
Google Cloud Vision API -This API was launched in 2016 by Google
- Supports more than 50 languages
- Google Cloud Vision API renders their outputs as JSON
- Can integrate through REST API, and can integrate with Google Images and Google SafeSearch
- Custom training not supported
- Accuracy 0.8 out of 10
Setup a Linux Environment for Tesseract
There are two steps to set up the environment. First, we want to install the Tesseract OCR engine and trained data files installation.
The Tesseract OCR engine package is generally called “tesseract-ocr” Thus you can install the latest version of Tesseract 4.1.1 through the terminal and its developer tools on Ubuntu Focal 20.04.
And also we need to install the command line program “libtesseract-dev” to work with the Tesseract OCR engine.
Then we need to install tesseract built binaries(supported languages and scripts) that are available directly from the Linux distributions through snapd by running the following command. If you do not install snapd, you have to run the below command before installing tesseract built binaries.
You can find other installation methods for various Operating systems here.
Tesseract contains a new neural network-based recognition engine that requires significantly more training data to deliver significantly higher accuracy contents. So we have to train our neural network application to return a better engine with higher accuracy. Normally, it takes a few days to a couple of weeks. So, I am going to choose to use an existing traineddata set that is trained on about 400000 text lines spanning about 4500 fonts.
There are three sets of traineddata files compatible with Tesseract 4.0x+, but I choose tessdata for my installation due to its support for the legacy recognizer. This traineddata is faster than other traineddata sets and has better accuracy. Below I provide some language links to get their traineddata.
You can also get other languages’ traineddata sets from here.
We have just finished the installation part!
Let’s create a simple Java Project in IntelliJ IDEA
Create a new project based on Maven and create an empty class. After creating the class my folder structure will be like this.
Modify your project’s pom.xml file to add the below dependency under dependencies element in pom.xml file to enable Maven.
After adding the above dependency, my pom.xml file will be like this.
Once we have an empty class, we can start adding some code to it. Here is the import statement for the instantiation of the Tesseract object.
and also we need to import the below statement to avoid exceptions rising while recognizing the text.
I am going to use files for recognition which are on my local computer. So I need to import the below statement to handle files.
Now, my IntelliJ IDEA view is
I will define a static method for Tesseract inside the class and Inside this method, make a new instance of Tesseract from the Maven library.
Next, I will add traineddata details to this instance like where the training data for this library can be found. As I mentioned earlier, I have downloaded the traineddata for English and stored it on my Desktop. So add the below two lines of code after instance creation inside the method.
Note: According to your traineddata file’s directory, the path will change.
Finally, return the instance. Now my screen will be:
The above method returns the text as plain text from the resource. If you want to return the result as an HTML, you need to tell Tesseract that the output we need is in the format something called the hOCR(HTML). Basically, the hOCR format is a simple XML-based format.
We can make it an hOCR format by adding the below statement above the return statement.
Finally, I put the main function below to make it usable and call the static method inside it.
Now, what we have to do is provide a file to Tesseract which it can parse and read its Text. In this example, I am going to try with an image that is stored in my Ubuntu Desktop.
So, after knowing about the input file path, creating a new file instance, and setting the file path, I also need to pass the file to tesseract for recognition and write a java print statement to get the output. Finally, your code view will be:
Since I settled all things correctly, still IntelliJ IDEA shows some errors in my files. It’s nothing but, we have to reload the project to load all dependency packages and sources.
Right-click on pom.xml file -> Maven -> Reload Project
Now, the problem is solved.
If we look closely, there is nothing we did much. That is the power of this wrapper for the Tesseract library I am provided with. Now we are ready to run our sample OCR program.
Note: Confirm that your project is set to the latest version of Java like Java 11. If you get an error regarding the Java version, you can simply set it to the latest version.
Go File -> Project Structure.
You will see a dialog box given below, Make sure that the Java version under Project and Modules section is set to the latest version of Java like Java 11.
Project Setting -> Project
Project Setting -> Modules
And also make sure that the Java Compiler setting also by pressing Ctrl + Alt + S. You will popup a dialog box, then navigate to Java Compiler as given below and check if the target bytecode version is set above to 1.5 for your project.
Fine. Now the time to take a test drive through our sample application, so I am going to use google’s logo as my sample input.
Note: Try to avoid PNG images and use JPG/JPEG images instead if you’re working with images at all. This is because Tesseract is very bad at reading PNG images due to its compression techniques.
Once I run the project, I get this simple output as:
For your reference, all the source code of the sample application is available on my Github.
Limitations of Tesseract
Since Tesseract is an OCR engine, it works well for only clean foreground text. For an instance, if we take an image which is having noise in the background, then we could not get an accurate result in the OCR test. So, there are many sorts of reasons for not getting good quality output from Tesseract. The image which has minimum requirements for its size, contrast, and lightning, recognizes better in the test. The lower quality images require preprocessing to improve the recognization results, such as scale appropriately, convert it to much contrast as possible, and the converted text to be horizontally aligned. Except for the following limitations, Tesseract OCR is a powerful OCR engine.
- Lower quality recognizations may output poor quality OCR results.
- The results do not have information about the text font family.
- It doesn’t work with files that having artifacts including partial occlusion, distorted perspective, and complex background.
- Poor in recognize images which having handwriting letters.
- When the input file contains languages other than the given in set function as an argument(shown below in the example), results in poor quality results.
- Two-column documents fail in the OCR test. That’s why it is not always best at recognizing the natural follow of input files and the result text will seem like the joined text across columns.
In this article, we have studied a lot of things about Tesseract, Tess4J, and some other sort of things and we have made a very simple Tesseract OCR engine which allows us to read text from various format files like PDF and image files(except PNG).
In addition to this, we have to pre-process the file before recognizing the input file to overcome the Tesseract Limitations and get better results. So, I will write about Tesseract Pre-Processing in my next article.