OCR in Mendix using Tesseract.js Part 1: Image-to-Text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text — whether from a scanned document, a photo of a document, or a scene photo.

Vignesh Rajan

Published in

Mendix Community

4 min readJan 19, 2023

OCR in Mendix using Tesseract.js Part 1: Image-to-Text (Banner Image) — A futuristic city, seen through the window of a spaceship, which is shaped like a Tesseract — OCR in Mendix using Tesseract.js Part 1: Image-to-Text — A futuristic city, seen through the window of a spaceship, which is shaped like a Tesseract

To better understand how OCR works, see the diagram process in the following picture. From the end-user side, the OCR process is simple — just process the image and then receive the editable text in return.

Tesseract.js is a javascript port of the famous Tesseract OCR engine and it is an optical character recognition engine for various operating systems. It is free software, released under the Apache License and development has been sponsored by Google since 2006. [source]

How to use Tessearct.js in Mendix

Prerequisites:

Community commons -https://marketplace.mendix.com/link/component/205247

Implementation:

In this blog, I’ll show you how to use Tesseract.js to add OCR to your Mendix application.

Use ‘SUB_GetBase64Image’ to convert image to base64 where base64 Java action is used and pass it to ‘JS_ImageOCR javascript’ action:

2) Select Load Language and Initialize Language:

3) Result — Fetch text from an image as string type:

Image used in application

Code explanation

1) Import tesseract.js and buffer:

2) Initialize And Run Tesseract:

A Worker helps you to do the OCR-related tasks. It takes a few steps to set up a Worker before it is fully functional, though. The full flow is:

· FS functions // optional

· loadLanguauge

· initialize

· set Parameters // optional

· recognize or detect

· terminate

Each function is async, so using async/await or Promise is required. When it is resolved, you get a TesseractJob object.

a) Worker.loadLanguage(langs):

Worker.loadLanguage() loads trained data from the cache or downloads trained data remotely and puts trained data into the WebAssembly file system.

Arguments:

langs is a string to indicate the languages trained data to download, multiple languages are concatenated with +, ex: eng+chi_tra

b) Worker.initialize(langs):

Worker.initialize() initializes the Tesseract API and makes sure it is ready for doing OCR tasks.

Arguments:

langs is a string to indicate the languages loaded by Tesseract API, it can be the subset of the language-trained data you loaded from Worker.loadLanguage.

c) Worker.recognize(image):

Worker.recognize() provides the core function of Tesseract.js as it executes OCR. Figures out what words are in the image, where the words are in the image, etc.

Note: Images should be sufficiently high resolution. Often, the same image will get much better results if you upscale it before calling recognize.

Arguments:

Image — see Image Format for more details.

Examples:

https://github.com/naptha/tesseract.js/blob/master/docs/examples.md

Supported File Types

These are the supported image types from Tesseract that their engine can read:

1. JPG

2. PNG

3. PNM

4. TIFF

Features:

• It supports multiple languages, check here for a complete list of supported languages.

• The accuracy is high with normal fonts and clear background.

Limitations:

• Accuracy will be low with noisy backgrounds and custom-scripted fonts.

• Tesseract doesn’t support all file formats by itself.

• The image quality must reach a certain threshold of Dots per Inch (DPI) points for it to work.

Conclusion:

After having fun with Tesseract OCR, I can say that the engine is amazing!

It brings the power of OCR to the browser and opens a door of opportunities for developers. Here is a list of the most interesting points on Tesseract in my opinion:

1. It's open source.

2. Very easy to use.

3. Good extraction results.

4. Supports multiple languages.

If you are facing some issues and think OCR is your solution, Tesseract would be a great solution! I hope this article is useful for you — thank you!!

GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)

This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new…

github.com

Tesseract documentation

Documentation

tesseract-ocr.github.io

Tesseract User Manual

This user manual is for Tesseract versions 5.x. For versions 4.x.x, 3.05.02 and older, see the documentation for old…

tesseract-ocr.github.io

From the Publisher -

If you enjoyed this article you can find more like it on our Medium page. For great videos and live sessions, you can go to MxLive or our community Youtube page.

For the makers looking to get started, you can sign up for a free account, and get instant access to learning with our Academy.

Interested in getting more involved with our community? Join us in our Slack community channel.

OCR in Mendix using Tesseract.js Part 1: Image-to-Text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text — whether from a scanned document, a photo of a document, or a scene photo.

To better understand how OCR works, see the diagram process in the following picture. From the end-user side, the OCR process is simple — just process the image and then receive the editable text in return.

How to use Tessearct.js in Mendix

Prerequisites:

Implementation:

Code explanation

Features:

Limitations:

Conclusion:

Read more

GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)

This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new…

Tesseract documentation

Documentation

Tesseract User Manual

This user manual is for Tesseract versions 5.x. For versions 4.x.x, 3.05.02 and older, see the documentation for old…

Published in Mendix Community

Written by Vignesh Rajan