Top Open-source OCR programs

Takoua Saadani
UBIAI NLP
Published in
5 min readSep 5, 2022

OCR is still a relatively new technology for business process automation, which is why most industries continue to rely on traditional systems.

But recently, as businesses are increasingly going digital, OCR technology is expected to become a requirement in a variety of industries, including communication, banking, insurance, legal, healthcare, tourism, and retail.

OCR, or Optical Character Recognition, is a technology that converts a physical paper document or an image into an electronic text-based version.

With OCR, it is now possible to read handwritten text with much greater accuracy and character recognition.

In this article, we will present seven Open-source OCR programs that you should know about if your business deals with data entry in any form, such as invoices and legal billing documentation, etc.

1- CuneiForm

“CuneiForm,” also known as “Cognitive OpenOCR,” is a multi-language, open-source optical character recognition system created by Cognitive Technologies.

It can analyze layouts and recognize text formats.

Pros and cons,

Cuneiform for Linux lacks a graphical user interface, but it does:

  • Support cross-platform.
  • It keeps the structure and formatting of the document.
  • The program recognizes tables of any structure and complexity, including those without grid lines.
  • It is available in 23 different languages.
  • It performs text format scanning, document identification, and layout analysis.

2- GOCR

The free optical character recognition program GOCR (or JOCR) converts or scans image files into text files.

Pros and cons,

GOCR reports trouble with serif fonts, overlapping characters, handwritten text, heterogeneous fonts, noisy images, large angles of skew, and text in anything other than a Latin alphabet, but:

  • It can be used as a command line app for other projects.
  • It is compatible with the Linux, Windows, and OS/2 operating systems.
  • GOCR can be used as a standalone command-line program or as a back-end for other programs.
  • It includes a gocr.tcl graphic interface.

3- Tesseract

Tesseract is a free and open-source OCR engine created by Hewlett-Packard.

Its OCR engine is regarded as one of the most accurate open-source systems available. And now it supports up to 116 languages with its latest stable version.

Pros and cons,

Tesseract requires a separate graphical user interface because it lacks one, yet :

  • It has an advanced image processing pipeline and can learn new information using neural networks.
  • It supports a variety of image formats such as PNG, JPEG, and TIFF.
  • Its output formats include plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV, and ALTO.
  • It can be taught to recognize different languages.

4- A9T9

A9T9 from Microsoft is a straightforward free and open-source software for optical character reading and recognition on Windows.

Pros and cons,

A9T9, like any other OCR software currently available, can only process printed documents.

  • A9T9 has an easy-to-use and installable application.
  • It is completely adware and spyware-free, with smooth customizability and source codes for improved development and modification options.
  • It also supports reading and OCR’ing PDF files.
  • It has a modern graphical user interface (GUI) front end for the Microsoft OCR library.

5- Kraken

Kraken is a complete OCR system designed for historical and non-Latin script documents.

It was created primarily to address Ocropus issues without interfering with its other functions.

It uses its CLSTM neural network library to gain new data experience from previous endeavors, and it requires some external libraries to run on different platforms.

Pros and cons,

Kraken can only be run on Linux or Mac OS X. Windows is not supported. However,

  • Its layout analysis and character recognition are fully trainable.
  • It supports multi-script recognition.
  • It supports compact model files.
  • It has character cuts and word bounding boxes.

6- Easy OCR

EasyOCR is a Python package that enables computer vision developers to perform optical character recognition with ease.

When it comes to OCR, EasyOCR is by far the easiest way to implement Optical Character Recognition.

Pros and cons,

It is a lightweight model that produces good results for receipt or PDF conversion. It also produces accurate results with organized text, such as pdf files, receipts, bills, and so on.

And even though Easy OCR is around 95% accurate:

  • It is installable with single pip command.
  • It supports over 80 different languages.
  • It is also acquainted with popular writing scripts such as Latin, Chinese, Arabic, Devanagari, Cyrillic, and so on.

7- PaddleOCR

PaddleOCR is a multilingual practical OCR tool that allows users to apply and train different models in a few lines of code.

PaddleOCR provides a set of high-quality pre-trained models to make OCR as accurate and close to commercial products as possible.

Pros and cons,

Some users have reported that even after adjusting the parameters, paddle OCR has serious problems detecting spaces, but:

  • It supports over 80 different languages.
  • It provides a variety of models, including the flagship PP-OCR and cutting-edge algorithms such as SRN, NRTR, and others.
  • It provides various models based on size.

CONCLUSION

There are still issues with using OCR software, such as the fact that OCR tools aren’t always 100 percent accurate and may not recognize every letter or number in a document. Language support is also a limitation, as there is no guarantee that all of your organization’s documents will be in English.

However, such challenges can be overcome by correcting images by sharpening and smoothing them out with tools like Tesseract to improve accuracy, or by using multilingual tools like UBIAI, which supports annotation in over 20 languages.

--

--

Takoua Saadani
UBIAI NLP

MSc in Projects Management I Associate Structural Engineer I Marketer