Creating an OCR microservice using Tesseract, PDFBox and Docker

Stefano Nassi
gft-engineering
Published in
7 min readJan 9, 2020
Photo by Agence Olloweb on Unsplash

In this tutorial, we are going to build an OCR (Optical Character Recognition) microservice that extracts text from a PDF document. To achieve this goal, we are going to integrate some open source tools and at the end, we are going to build a Docker image ready for deployment.

This is what we are going to do:

  • A brief introduction of the technological stack used
  • Develop a Spring Boot microservice
  • Integrate the OCR logic
  • Create a Docker image of the microservice in a Dev environment

Introduction

When we talk about PDF documents, we need to distinguish between two different types:

  • Searchable PDF
  • Scanned PDF

A Searchable PDF is a document created by PDF printer software (e.g. Ghostscript). For this type of document, extracting text is easy because the document already contains text that machines can read.

A Scanned PDF, by contrast, is a document created by a physical scanner. This type of document contains images, one per page of the document, and the text is inside those images. Therefore, to extract text from this kind of document, we need to use an OCR software.

Architecture Overview

The image below shows what we are going to build.

However, before we start coding, let’s take a look at the technology stack used:

Spring Boot

From the spring.io website:

“Spring Boot makes it easy to create stand-alone, production-grade Spring based applications that you can “just run”.”

That’s it…Spring Boot is an open source Java-based framework used to create microservices. The “magic” in the framework is that the generated jar embeds a server instance (tomcat), so your application can run standalone in a serverless environment.

PDFBox

Apache PDFBox is an open source Java library that can be used to manage PDF documents. It can be used to create, render, print, split and so on, PDF files. In this tutorial we will use two of these features:

  • Extract text from a searchable PDF document
  • Extract images from a scanned PDF document

Apache PDFBox is published under the Apache License v2.0.

Tesseract

First of all, let’s talk about “Tesseract”. If you are a Marvel addict, your mind will go to the cube that houses the Space Stone, but today we would like to talk about an interesting open source project (under the terms of the Apache License 2.0) called Tesseract OCR.

Tesseract OCR is a component that can be used to extract text from images.

The Tesseract project was born in the Hewlett Packard laboratories at the end of the 80s and since 2006 Google has been in charge of its development.

https://github.com/tesseract-ocr/tesseract

Tesseract is available directly from almost all Linux distribution repositories, so you can install it simply by running:

sudo apt install tesseract-ocr

but if you want to test the last dev version, you can build it yourself directly from git.

The Tesseract project supports about 130 languages out of the box.

Finally… let’s start coding…

Microservice Development

Let’s write our microservice using the Spring Boot framework. In this tutorial we use Eclipse as IDE, so make sure that you have Spring Tools and Maven installed from Eclipse Marketplace

Now, we can start coding…

1. From the File menu, select New->other->wizard->Spring Starter Project

2. Fill the project attributes as illustrated below and click the Finish button

3. The wizard will create the project structure for you, and now we have a simple maven project with the skeleton of our Spring Boot application.

4. The most important generated file is SimpleOcrMicroserviceApplication, which is created by the SpringTools plugin:

This is the Spring Boot main class. Our Spring Boot Microservice application will load via this class.

(For more information about Spring Boot application structure, take a look at the official documentation https://spring.io/guides/gs/spring-boot/)

5. The next step is to create the Controller class. A Controller is an annotated class that Spring Boot will be exposed as a Rest API, so it can handle HTTP requests. Controller classes are easily identified by the @RestController annotation.

6. Before creating the controller class, we need to add some dependencies to our pom.xml file:

spring-boot-starter-web, is the library that imports into the project all packages and classes to handle the web project (as an API rest). We will also use org.json to handle JSON content-type in our rest API.

7. We can now create a new class as below:

The class has a @RestController annotation that identifies it as an API Rest controller.

The method extractTextFromPDFFile, has a @PostMapping annotation. This annotation indicates that this API will be exposed as a POST method, and its endpoint will be:

http://localhost:8080/api/pdf/extractText

We can also notice that the method argument has a @RequestParam annotation¸ this can be used to tell the name and kind of param in the request body to the method. For our purpose, we will expect a MultiPart file.

Finally, the method creates a JSONObject and puts it into the response object.

This is the skeleton of our microservice. Now we need to add business logic to reach our goal.

Implement business logic

We want to build this flow:

In the previous paragraph, we completed the first step of the flow, receiving the PDF file from an API request.

The next step is to check whether the PDF is searchable; PDFBox library will help us for this purpose, so add the dependency to our pom file

And modify our method as below

We used the PDDocument static load to create an instance of PDDocument from the raw pdf bytes and then we used a PDFTextStripped instance to extract text from the PDDocument.

If the PDFTextStripped instance is able to extract text from the document, we can build the JSON object for the response and exit, otherwise if the extracted text is empty it means that we don’t have a searchable PDF on hand… simple.

So, if we are working with a pdf document with scanned images, we need to extract them from the file in order to process them with the OCR library, again PDFBox helps us for this purpose, so we start to implement the extractTextFromScannedDocument method.

The PDFRenderer class can be used to extract images from the PDF document. The code is quite simple, but you need to pay attention to the renderImageWithDPI method.

The method parameters tell the class the quality of the image extracted from the document. This parameter will then affect the accuracy of the OCR process, on the other hand, generating higher quality images will make the process slower. So it is necessary to find a balance point.

Now we can insert the OCR elaboration using the Tesseract library, so add this dependency to the pom file

Tess4j is a library that wraps the calls to the core Tesseract library.

Modify the method to integrate Tesseract:

We created a Tesseract object and set some parameters in order to tell it where data files are (we will further analyse this path in the next paragraph) and, most importantly, the language that we want to use. Finally, the doOCR method does the magic.

Packaging and deploying

The last goal of this tutorial is packaging the microservice in a Docker container in order to deploy it wherever you want.

So create an empty Dockerfile and put it in the project root folder.

The Dockerfile should be like this:

We started from alpine-based Docker image in order to obtain a light weight image.

The next step is crucial because we will install the Tesseract application on the OS. The first command is a classic install from Linux package repository. You have to pay attention to the next step.

Tesseract OCR uses external files to set the language to the OCR process. There are lots of languages supported, you can find a list here:

https://github.com/tesseract-ocr/tessdata_best

Since version 4.0.0, you can choose between two language models for each language:

  • Fast: Faster, but less accurate
  • Best: Very accurate, but slower

Go back to the Docker file and look at the following lines:

RUN mkdir -p /usr/share/tessdataADD https://github.com/tesseract-ocr/tessdata/raw/master/ita.traineddata /usr/share/tessdata/ita.traineddata

We create a folder to store language files and download one or more into it. In fact you can install multiple languages and choose programmatically which one to use.

In the Java class, we set the Tesseract object with the data path according to the path where Docker downloaded the file, and the language to use.

ITesseract _tesseract = new Tesseract();_tesseract.setDatapath("/usr/share/tessdata/");_tesseract.setLanguage("ita"); // choose your language

We won’t analyse the rest of the Dockerfile because there are just standard commands to put a Spring Boot application in a container.

Finally we can build the Docker image:

docker build -t nassiesse/simple-java-ocr .

and run it:

docker run -t -i -p 8080:8080 nassiesse/simple-java-ocr

Test the microservice

You can test the microservice with a POST call to the following endpoint uploading a PDF file:

http://localhost:8080/api/pdf/extractText

You can use Postman, SoapUI or the Rest client you prefer.

That’s all folks!

Full code on GitHub

You can find the sample project on my Git:

Conclusion

In this tutorial we created a very simple OCR project. This is very far from being fully functional, but it might be a starting point for more complex projects.

References

https://github.com/tesseract-ocr/tesseract

https://pdfbox.apache.org/

https://www.docker.com/

https://spring.io/blog/2015/07/14/microservices-with-spring

--

--