Review of Best Open-Source OCR Tools

Published in

Technovators

5 min readJul 11, 2021

Tesseract is not the only open-source option for OCR💔

With the advent of deep learning, we now have various open-source OCR options that outsmart Tesseract on different use cases.

In this blog, we’ll review some of the best open-source OCR options and also directions for choosing the best option for a particular use case.

Optical Character Recognition converts Images/Scanned Documents(Input) into editable and searchable machine-encoded text(Output). While we have control over the Output(it can be plain text, hOCR, XML, Editable PDF), the nature of the input varies based on the use case and needs to be considered before building any OCR pipeline.

Scanned Documents: Printed / Handwritten text recognition, commonly considered as OCR problem
Digital Images: Typed / Handwritten text, considered as Scene Text Recognition or OCR in the Wild

Both of these areas have their own challenges and require different components(For example, pre-processing pipeline) in various stages of the OCR pipeline. We’ll evaluate the tools for both of these scenarios.

Any OCR output is as good as the input document, so understanding the input document and accordingly designing pre-processing pipelines will give a performance boost irrespective of what OCR engine you use. Check out this blog if you’re interested to know about various pre-processing options to improve the OCR quality.

Tesseract

Tesseract is one of the most popular OCR open-source engines developed in C++ and has wrappers available for Python, Java, Swift, Ruby, etc, and recognizes text from more than 100 languages.

One Size Doesn’t Fit All

Configuring Tesseract:

Tesseract is vast, so experimenting with various options can improve the performance substantially.

Page Segmentation Modes

Tesseract uses Leptonica for pre-processing and text segmentation and has various options for page segmentation.

Recognition Models

tessdata_fast: Tesseract is written in C++ and optimized for performance but if you further need to improve the performance, try using tessdata_fast models which are 8-bit integer versions of the tessdata models. There will be some speed-accuracy trade-offs but worthy to experiment.
tessdata_best: Best trained models of tesseract OCR and acts as the base models for fine-tuning.

Multilingual Text Recognition

Using the “-l” option we can use/add languages supported by tesseract. These languages should have a .traineddata file.
We can add languages separated by the “+” sign, -l deu+eng

Engine Mode (OEM)

Tesseract 4 introduced LSTM models for Text recognition which often works best, still, you can use the Tesseract 3 Legacy mode or Combine Legacy + LSTM using the OEM option

  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.

Whitelisting and Blacklisting Characters

Tesseract allows us to configure the output text using the “-c” option.

Whitelisting: -c tessedit_char_whitelist=<characters to whitelist>
Blacklisting: -c tessedit_char_blacklist=<characters to blacklist>

PROS

Optimized for CPU, and has wrappers in multiple programming languages
Still a better open-source option for scanned documents
Multilingual and other configuration options
Has Docker support

CONS

Since it uses the classical page layout analysis technique, text detection is not as accurate compared to its deep learning peers and cannot be directly used for scene text detection. However, using CRAFT /EAST for Text Detection(Deep learning models and supports multilingual text detection) and then using Tesseract for Text Recognition will yield better results in STR.
Not Optimized for GPU and batch processing

EasyOCR

Ready-to-use OCR with 80+ language supports and growing fast. It integrates various open-source researches/codes.

Pipeline

Text Data Generator to train OCR model
Uses CRAFT for text detection: Accurate scene text detection and supports multilingual text detection
CRNN for Text Recognition: End-To-End Trainable model with Resnet for Feature Extraction, LSTM, and CTC for Decoding. Language-specific Trained models are available and can be used while creating reader object, reader = easyocr.Reader([‘ch_sim’,’en’]
Word Segmentation and Beam Search Decoder
Rearrange text into paragraphs based on language mode “ltr” or “rtl”

PROS

Well suited for Scene Text Recognition
Pytorch😍Deep learning models for Detection, Recognition
Implementation roadmap shows configurable options for Detection, Recognition and Decoding steps in the future, and also Handwritten Recognition
GPU support and batch prediction, faster compared to Tesseract’s OpenCL version
Multilingual and Vertical Text Support
Includes Options for Whitelisting, Blacklisting, output_format, Image manipulations

CONS

Slower than Tesseract in CPU Mode
Not better than Tesseract for Scanned Documents

PaddleOCR

The latest version of PaddleOCR uses PGNet, an end-to-end trainable OCR model that shares CNN features with both detection and recognition models.

PROS

End-to-End OCR Models
Data Synthesis and Semi-Automated OCR annotation solutions
Multilingual OCR Support
Different Inference and Serving Options, Benchmarks

CONS

Has dependency on paddlepaddle
Need to use external libraries for Paddle to Pytorch / ONNX conversion

Other Open-source Options

MMOCR

Based on Pytorch and MMDetection😍, I just love their modelzoo and other open-mmlab projects
They have multiple Detection and Recognition models and also models for Downstream tasks like NER and Key Information Extraction
Looking forward to seeing Multilingual support😃

KerasOCR

Packaged Version of Keras CRNN and CRAFT detector

Conclusion

We have compared various OCR tools in terms of their flexibility, strength in detection and recognition for different use cases, and performance.

Here is a decision tree on selecting the OCR tools for different use cases.

Open-source OCR Decision Tree, Image by Author

For Documents Scanned using Flatbed Scanners, Tesseract is still a better option if you can horizontally scale the OCR instances with CPUs instead of using GPUs for batch processing.

Before selecting the options based on this decision tree, I would recommend you to review the latest features of these OCR tools and also test them on your dataset.

References

Survey on Image Preprocessing Techniques to Improve OCR Accuracy

Even the best OCR tool will fail to produce good results when the input image/document quality is too bad. Use these…

medium.com

Scene Text Detection In Python With EAST and CRAFT

Text Detection in the Wild is very challenging. However, with the advent of deep learning, we are in the right direction…

medium.com

Happy Learning!!!😍

Review of Best Open-Source OCR Tools

Tesseract

Configuring Tesseract:

PROS

CONS

EasyOCR

Pipeline

PROS

CONS

PaddleOCR

CONS

Other Open-source Options

MMOCR

KerasOCR

Conclusion

References

Related Articles

Survey on Image Preprocessing Techniques to Improve OCR Accuracy

Even the best OCR tool will fail to produce good results when the input image/document quality is too bad. Use these…

Scene Text Detection In Python With EAST and CRAFT

Text Detection in the Wild is very challenging. However, with the advent of deep learning, we are in the right direction…

Written by Mageshwaran R