Robust OCR on Custom document with TFOD and EasyOCR

Nnamaka
5 min readMay 22, 2022

The OCR engine we will implement is broken down into two major components — Text Detection and Text Recognition. The TFOD(Tensorflow Object Detection) API will be used for Text detection while EasyOCR will be used for Text Recognition.

What is OCR?

Optical character recognition is the conversion of images of typed, handwritten, or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo, or subtitle text superimposed on an image.

OCR converts text in an Image to machine-readable text format.

Use cases of OCR

  • Capturing-invoices — Companies can quickly extract data from bills using a combination of OCR and other AI approaches.
  • OCR in the Banking sector — With mobile banking apps, through OCR-based check depositing tools, your checks can be deposited digitally and processed in no time.
  • Healthcare industry — Data from X-ray reports, patient histories, treatments or diagnostics, tests, and general hospital records can be digitized using OCR technology.

Next, we will be going over the implementation of an OCR engine that comprises of The ssd_mobilenet for ROI(Region of interest) Text detection and EasyOCR model for Text Recognition. The complete code for this post is hosted here on my Github account. Please Enjoy!

Implementation

We will start by creating and annotating our dataset. A great tool to annotate our dataset is labelImg.

The goal of this project is to use OCR to extract ‘chapter’ and ‘title’ from a textbook. This moderate project implementation will showcase the features and innovations in the frameworks and tools we are about to use.

The core software technologies used in this project are:

Now let's begin!

Step 1 — Collect and Annotate the Image dataset

I used my mobile device camera to take pictures of the documents I want to perform OCR on. I collected 77 images in total. My camera Quality or Resolution is Great. Then I gathered those images into a folder and annotated them by marking bounding boxes on the region of interest.

I chose the dataset to be in PascalVOC format that will later be converted to TFRecord format and then fed into the TFOD training pipeline. Because I selected PascalVOC format, labelImg produced an XML file per image. The format of the XML file is shown below.

After that I had a script split my dataset into ‘train/test’ folders- the script is on my Github account. To use it, run the following command on your terminal in the directory where you have the script and your dataset all in the same place.

partition_dataset.py [-h] [-i IMAGEDIR] [-o OUTPUTDIR] [-r RATIO] [-x]

step 2 — Prepare Dataset and Train.

I described in detail how to do this step in this blog post.

I trained my custom model for 2000 epochs.

I chose to train my dataset on the “SSD MobileNet V2 FPNLite’ model. I did transfer learning(using the features and weights of already trained models to learn the features of my dataset). I repurposed the model for my use case.

step 3 Run detections on the trained model

Refer to my GitHub repo for the full code.

I trained on Google Colab.

The training ran for 2000 epochs. Training checkpoints were saved. These checkpoints are the weights and details of the model. A checkpoint marks the best state of the model compared to its previous state.

We have to pass our image into the model so it can detect the region of interest and return the values of bounding box coordinates around this region.

Let's restore the latest checkpoint of the trained model. mine is ‘ckpt-3’

ckpt = tf.compat.v2.train.Checkpoint(model=detection_model)
ckpt.restore(os.path.join(paths['CHECKPOINT_PATH'], 'ckpt-3')).expect_partial()

Now we pass in the document image to the model. Below, I created a utility function to run inference on the model

def detect_fn(image):
image, shapes = detection_model.preprocess(image)
prediction_dict = detection_model.predict(image, shapes)
detections = detection_model.postprocess(prediction_dict, shapes)
return detections

Infer the model(pass in the image document to the model). The image document was first converted to a tensor( A data type our model recognizes and works well with).

detections = detect_fn(input_tensor)

step 4 — Pass in the detections to our EasyOCR model

EasyOCR performs OCR in 80+ languages. Its implementation is based on Pytorch.

First, we install EasyOCR into our working environment.

!pip install easyocr

Instantiate a reader object.

reader = easyocr.Reader(['en'])

And then we make inferences.

ocr_result = reader.readtext(region)

Please there are minor steps carried out in between the major steps in this post. The code repo contains the full pipeline. Check it out!

The output accuracy from this OCR engine is great, even if we trained on 77 images. The pipeline poses a powerful combination for an OCR engine.

EasyOCR interpreted the results from the models' output as ‘unit 6’ and ‘immunization’ with an accuracy of 0.99(99%) and 0.77(77%) respectively.

Observations

The accuracy of the TFOD model can be increased dramatically by just adding more training images to the dataset. We can also increase the number of epochs, to give our model enough time to learn more features from the dataset.

EascyOCR is still active and maintained. We could squeeze out more performance from it by passing quality and processed images to it. An image mid-processing component can be added to this OCR engine in between the Text detection and Text Recognition components.

Conclusion

Optical character recognition (OCR) is a technology with enormous promise. As a result, it’s not surprising that it’s swiftly gaining traction in the industry!

OCR technology is at the core of a growing trend when it comes to workflow modernization.

Contact me

Twitter — https://twitter.com/DNnamaka

Github — https://github.com/Nnamaka

Email — nnamaka7@gmail.com

--

--