Review of Best Open-Source OCR Tools

Mageshwaran R
Jul 11 · 5 min read

Tesseract is not the only open-source option for OCR💔

With the advent of deep learning, we now have various open-source OCR options that outsmart Tesseract on different use cases.

In this blog, we’ll review some of the best open-source OCR options and also directions for choosing the best option for a particular use case.

Photo by Austin Chan on Unsplash

Optical Character Recognition converts Images/Scanned Documents(Input) into editable and searchable machine-encoded text(Output). While we have control over the Output(it can be plain text, hOCR, XML, Editable PDF), the nature of the input varies based on the use case and needs to be considered before building any OCR pipeline.

  • Scanned Documents: Printed / Handwritten text recognition, commonly considered as OCR problem
  • Digital Images: Typed / Handwritten text, considered as Scene Text Recognition or OCR in the Wild

Both of these areas have their own challenges and require different components(For example, pre-processing pipeline) in various stages of the OCR pipeline. We’ll evaluate the tools for both of these scenarios.

Any OCR output is as good as the input document, so understanding the input document and accordingly designing pre-processing pipelines will give a performance boost irrespective of what OCR engine you use. Check out this blog if you’re interested to know about various pre-processing options to improve the OCR quality.

Tesseract

Tesseract is one of the most popular OCR open-source engines developed in C++ and has wrappers available for Python, Java, Swift, Ruby, etc, and recognizes text from more than 100 languages.

One Size Doesn’t Fit All

Configuring Tesseract:

Tesseract is vast, so experimenting with various options can improve the performance substantially.

Page Segmentation Modes

Tesseract uses Leptonica for pre-processing and text segmentation and has various options for page segmentation.

Recognition Models

  • tessdata_fast: Tesseract is written in C++ and optimized for performance but if you further need to improve the performance, try using tessdata_fast models which are 8-bit integer versions of the tessdata models. There will be some speed-accuracy trade-offs but worthy to experiment.
  • tessdata_best: Best trained models of tesseract OCR and acts as the base models for fine-tuning.

Multilingual Text Recognition

  • Using the “-l” option we can use/add languages supported by tesseract. These languages should have a .traineddata file.
  • We can add languages separated by the “+” sign, -l deu+eng

Engine Mode (OEM)

Tesseract 4 introduced LSTM models for Text recognition which often works best, still, you can use the Tesseract 3 Legacy mode or Combine Legacy + LSTM using the OEM option

  0    Legacy engine only.
1 Neural nets LSTM engine only.
2 Legacy + LSTM engines.
3 Default, based on what is available.

Whitelisting and Blacklisting Characters

Tesseract allows us to configure the output text using the “-c” option.

  • Whitelisting: -c tessedit_char_whitelist=<characters to whitelist>
  • Blacklisting: -c tessedit_char_blacklist=<characters to blacklist>

PROS

  • Optimized for CPU, and has wrappers in multiple programming languages
  • Still a better open-source option for scanned documents
  • Multilingual and other configuration options
  • Has Docker support

CONS

  • Since it uses the classical page layout analysis technique, text detection is not as accurate compared to its deep learning peers and cannot be directly used for scene text detection. However, using CRAFT /EAST for Text Detection(Deep learning models and supports multilingual text detection) and then using Tesseract for Text Recognition will yield better results in STR.
  • Not Optimized for GPU and batch processing

EasyOCR

Ready-to-use OCR with 80+ language supports and growing fast. It integrates various open-source researches/codes.

EasyOCR pipeline

Pipeline

  • Text Data Generator to train OCR model
  • Uses CRAFT for text detection: Accurate scene text detection and supports multilingual text detection
  • CRNN for Text Recognition: End-To-End Trainable model with Resnet for Feature Extraction, LSTM, and CTC for Decoding. Language-specific Trained models are available and can be used while creating reader object, reader = easyocr.Reader([‘ch_sim’,’en’]
  • Word Segmentation and Beam Search Decoder
  • Rearrange text into paragraphs based on language mode “ltr” or “rtl”

PROS

  • Well suited for Scene Text Recognition
  • Pytorch😍Deep learning models for Detection, Recognition
  • Implementation roadmap shows configurable options for Detection, Recognition and Decoding steps in the future, and also Handwritten Recognition
  • GPU support and batch prediction, faster compared to Tesseract’s OpenCL version
  • Multilingual and Vertical Text Support
  • Includes Options for Whitelisting, Blacklisting, output_format, Image manipulations

CONS

  • Slower than Tesseract in CPU Mode
  • Not better than Tesseract for Scanned Documents

PaddleOCR

The latest version of PaddleOCR uses PGNet, an end-to-end trainable OCR model that shares CNN features with both detection and recognition models.

PROS

CONS

  • Has dependency on paddlepaddle
  • Need to use external libraries for Paddle to Pytorch / ONNX conversion

Other Open-source Options

MMOCR

  • Based on Pytorch and MMDetection😍, I just love their modelzoo and other open-mmlab projects
  • They have multiple Detection and Recognition models and also models for Downstream tasks like NER and Key Information Extraction
  • Looking forward to seeing Multilingual support😃

KerasOCR

  • Packaged Version of Keras CRNN and CRAFT detector

Conclusion

We have compared various OCR tools in terms of their flexibility, strength in detection and recognition for different use cases, and performance.

Here is a decision tree on selecting the OCR tools for different use cases.

Open-source OCR Decision Tree, Image by Author

For Documents Scanned using Flatbed Scanners, Tesseract is still a better option if you can horizontally scale the OCR instances with CPUs instead of using GPUs for batch processing.

Before selecting the options based on this decision tree, I would recommend you to review the latest features of these OCR tools and also test them on your dataset.

References

Related Articles

Happy Learning!!!😍

Technovators

Sharing Ideas on AI, Neuroscience and other cutting edge technologies