Printed and handwritten text extraction from images using Tesseract and Google Cloud Vision API

Text extraction from image files is a useful technique for document digitalization. There are several well developed OCR engines for printed text extraction, such as Tesseract and EasyOCR [1]. However, for handwritten text extraction, it’s more challenging because of large variations in handwriting from person to person. Tesseract and EasyOCR can’t achieve satisfying results unless the texts are hand-printed. In this post, I will describe how to use Tesseract to extract printed texts, and use Google Cloud Vision API to extract handwritten texts.

The example text image file is from the IAM handwriting dataset [2]. It has a printed text session, and handwritten session for the same text content.

The following major tools are used:

OpenCV, For finding structures in the images to automatically break the images into printed segments and handwritten segments
Google Cloud Vision API, For extract text from handwriting segment
Tesseract and Pytesseract, For extract text from printed segment

Step 1. Page Segmentation.

We will use OpenCV to find lines between sections, and use the coordinates of the lines to break the image into segments. OpenCV has a function called getStructuringElement(). We can define the structure type as a rectangle (“MORPH_RECT”), minimum width (200), and height (1) of the rectangle to find horizontal lines.

We can then utilize the coordinates of the horizontal lines to break the images into segments as shown below.

From the above segmentation results, we can see that the segments containing the printed and handwritten texts that we are interested in are segment 2 and 3.

Step 2. Extract Printed Text

In this step, we will use Tesseract OCR engine to extract printed text from an image segment. If you don’t already have Tesseracct installed on your machine, you can download the installation file from here.

You will also need to install the pytesseract library in order to call Tesseract engine from Python.

Now we can use Tesseract OCR with Python to extract text from the image segments.

Tesseract OCR doesn’t work well on handwritten texts. When passing the handwritten segment into Tesseract, we get very poor reading results. See below.

For handwritten text, we will use Google Cloud Vision API to get better results.

Step 3. Extract handwritten text using Google Cloud Vision API

In order to use the Google Cloud Vision API, you will need to login to your google account, create a project, or select an existing project, then enable Cloud Vision API. You will also need to create a service account key and save its json file to your local drive following the instruction on Google Cloud.

Now we can specify the location of json file that has the service account key, and use the following Python script to feed the handwritten image to Google Cloud Vision API to extract text from it.

From the above results, we can see that Google Cloud Vision API has done a much better job in extracting texts from image files than Tesseract.

Source code can be found on GitHub: https://github.com/DerrickFeiWang/HandwritingRecognition_GoogleCloudVision

References

  1. Chejui Liao, OCR Engine Comparison — Tesseract vs. EasyOCR. https://medium.com/swlh/ocr-engine-comparison-tesseract-vs-easyocr-729be893d3ae
  2. IAM Handwriting Dataset. http://www.fki.inf.unibe.ch/databases/iam-handwriting-database