Review of Best Open-Source OCR Tools
Tesseract is not the only open-source option for OCR💔
With the advent of deep learning, we now have various open-source OCR options that outsmart Tesseract on different use cases.
In this blog, we’ll review some of the best open-source OCR options and also directions for choosing the best option for a particular use case.
Optical Character Recognition converts Images/Scanned Documents(Input) into editable and searchable machine-encoded text(Output). While we have control over the Output(it can be plain text, hOCR, XML, Editable PDF), the nature of the input varies based on the use case and needs to be considered before building any OCR pipeline.
- Scanned Documents: Printed / Handwritten text recognition, commonly considered as OCR problem
- Digital Images: Typed / Handwritten text, considered as Scene Text Recognition or OCR in the Wild
Both of these areas have their own challenges and require different components(For example, pre-processing pipeline) in various stages of the OCR pipeline. We’ll evaluate the tools for both of these scenarios.
Any OCR output is as good as the input document, so understanding the input document and accordingly designing pre-processing pipelines will give a performance boost irrespective of what OCR engine you use. Check out this blog if you’re interested to know about various pre-processing options to improve the OCR quality.
Tesseract is one of the most popular OCR open-source engines developed in C++ and has wrappers available for Python, Java, Swift, Ruby, etc, and recognizes text from more than 100 languages.
One Size Doesn’t Fit All
Tesseract is vast, so experimenting with various options can improve the performance substantially.
Page Segmentation Modes
- tessdata_fast: Tesseract is written in C++ and optimized for performance but if you further need to improve the performance, try using tessdata_fast models which are 8-bit integer versions of the tessdata models. There will be some speed-accuracy trade-offs but worthy to experiment.
- tessdata_best: Best trained models of tesseract OCR and acts as the base models for fine-tuning.
Multilingual Text Recognition
- Using the “-l” option we can use/add languages supported by tesseract. These languages should have a .traineddata file.
- We can add languages separated by the “+” sign, -l deu+eng
Engine Mode (OEM)
Tesseract 4 introduced LSTM models for Text recognition which often works best, still, you can use the Tesseract 3 Legacy mode or Combine Legacy + LSTM using the OEM option
0 Legacy engine only.
1 Neural nets LSTM engine only.
2 Legacy + LSTM engines.
3 Default, based on what is available.
Whitelisting and Blacklisting Characters
Tesseract allows us to configure the output text using the “-c” option.
- Whitelisting: -c tessedit_char_whitelist=<characters to whitelist>
- Blacklisting: -c tessedit_char_blacklist=<characters to blacklist>
- Optimized for CPU, and has wrappers in multiple programming languages
- Still a better open-source option for scanned documents
- Multilingual and other configuration options
- Has Docker support
- Since it uses the classical page layout analysis technique, text detection is not as accurate compared to its deep learning peers and cannot be directly used for scene text detection. However, using CRAFT /EAST for Text Detection(Deep learning models and supports multilingual text detection) and then using Tesseract for Text Recognition will yield better results in STR.
- Not Optimized for GPU and batch processing
Ready-to-use OCR with 80+ language supports and growing fast. It integrates various open-source researches/codes.
- Text Data Generator to train OCR model
- Uses CRAFT for text detection: Accurate scene text detection and supports multilingual text detection
- CRNN for Text Recognition: End-To-End Trainable model with Resnet for Feature Extraction, LSTM, and CTC for Decoding. Language-specific Trained models are available and can be used while creating reader object, reader = easyocr.Reader([‘ch_sim’,’en’]
- Word Segmentation and Beam Search Decoder
- Rearrange text into paragraphs based on language mode “ltr” or “rtl”
- Well suited for Scene Text Recognition
- Pytorch😍Deep learning models for Detection, Recognition
- Implementation roadmap shows configurable options for Detection, Recognition and Decoding steps in the future, and also Handwritten Recognition
- GPU support and batch prediction, faster compared to Tesseract’s OpenCL version
- Multilingual and Vertical Text Support
- Includes Options for Whitelisting, Blacklisting, output_format, Image manipulations
- Slower than Tesseract in CPU Mode
- Not better than Tesseract for Scanned Documents
The latest version of PaddleOCR uses PGNet, an end-to-end trainable OCR model that shares CNN features with both detection and recognition models.
- End-to-End OCR Models
- Data Synthesis and Semi-Automated OCR annotation solutions
- Multilingual OCR Support
- Different Inference and Serving Options, Benchmarks
- Has dependency on
- Need to use external libraries for Paddle to Pytorch / ONNX conversion
Other Open-source Options
- Based on Pytorch and MMDetection😍, I just love their modelzoo and other open-mmlab projects
- They have multiple Detection and Recognition models and also models for Downstream tasks like NER and Key Information Extraction
- Looking forward to seeing Multilingual support😃
- Packaged Version of Keras CRNN and CRAFT detector
We have compared various OCR tools in terms of their flexibility, strength in detection and recognition for different use cases, and performance.
Here is a decision tree on selecting the OCR tools for different use cases.
For Documents Scanned using Flatbed Scanners, Tesseract is still a better option if you can horizontally scale the OCR instances with CPUs instead of using GPUs for batch processing.
Before selecting the options based on this decision tree, I would recommend you to review the latest features of these OCR tools and also test them on your dataset.
Survey on Image Preprocessing Techniques to Improve OCR Accuracy
Even the best OCR tool will fail to produce good results when the input image/document quality is too bad. Use these…
Scene Text Detection In Python With EAST and CRAFT
Text Detection in the Wild is very challenging. However, with the advent of deep learning, we are in the right direction…