Create an unstoppable text recognition machine with TesseractJS

Alexey Sutyagin
Geek Culture
Published in
4 min readJan 20, 2023
Photo by palesa on Unsplash

When working with JS, we encounter tasks requiring text recognition on images. To implement this functionality on the server, there is an excellent library, TesseractJS. Let’s try implementing batch image recognition and see examples of working with the library.

Preparations

In the new folder, initiate npm for our project and install TesseractJS.

npm init -y && npm install tesseract.js

Let’s add to the folder images_to_text the images on which we want to recognize text. You can check the list of supported formats at https://github.com/naptha/tesseract.js/blob/master/docs/image-format.md.

Now we can start writing the code.

Load desired languages

To recognize images, we need workers that will do it. Before the first run, TesseractJS will try to download the prepared data for recognition, and without the internet, recognition will not work immediately. If you want it to work offline — you can use the example from https://github.com/jeromewu/tesseract.js-offline.

One image recognition

The simplest code that recognizes one picture with text in English, described in the examples to the library, looks like this:

import { createWorker } from 'tesseract.js';

const worker = await createWorker({
logger: m => console.log(m)
});
(async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('<https://tesseract.projectnaptha.com/img/eng_bw.png>');
console.log(text);
await worker.terminate();
})();

A list of languages that TesseractJS can work with can be found at https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016.

Several pictures recognition

To recognize multiple images, we can use a loop and an array of links or paths to the pictures.

const filenamesArray = ["../images_to_text/image1.jpg", "../images_to_text/image2.jpg", "../images_to_text/image3.jpg"];
for (let filename of filenamesArray) {
const responsedText = await worker.recognize(filename);
}

Scheduler + concurrency

Let’s assume that we have many images and want to speed up the recognition process by using multiple workers. To do that, let’s utilize the scheduler functionality built into TesseractJS. It allows us to create several Workers and distribute jobs between them. Let’s implement the recognizeImages function, which will take all files from the specified directory and try to recognize them.

const IMAGES_DIRECTORY = 'images_to_text';
const CONCURRENCY = 15;

Check for existanse

So that you don’t have to do the work all over again in case of errors — let’s save the result to a folder and check it before you recognize it.

const IMAGES_DIRECTORY = 'images_to_text';
const CONCURRENCY = 15;

async function recognizeImages() {
const scheduler = createScheduler();

try {
const files = await fs.readdir(IMAGES_DIRECTORY);

for (let i = 0; i < CONCURRENCY; i++) {
const worker = createWorker({cachePath: "."});
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
scheduler.addWorker(worker);
}

await Promise.all(
files.map(async (imageName) => {
const imagePath = path.join(IMAGES_DIRECTORY, imageName);

await scheduler
.addJob('recognize', imagePath)
.then(async (result) => {
console.log(`Recognized text ${result}`);
})
.catch((err) => {
console.log(
'=======================ERROR======================='
);
console.log(`error parsing ${imageName}: \n\t\t${err}`);
console.log(
'=====================END_ERROR====================='
);
});
})
);
} catch (err) {
console.error(err);
} finally {
await scheduler.terminate();
}
}

(async () => {
return await recognizeImages();
})();

ProgressBar

To have more fun watching the progress, let’s add a progress bar, which will display the current progress and the approximate time remaining in the console. To do this, before starting recognition, we will initiate the progress bar. We will measure the time each time after the task is done

const bar = new ProgressBar(
' recognizing [:bar] :rate/bps :percent :etas',
{
total: files.length,
width: 200,
complete: '#',
incomplete: ' ',
}
);
// same code as before
bar.tick(); // when it necessary

Conclusion

In this article, we got acquainted with TesseractJS API and learned how to recognize many jobs with maximum efficiency using Node.js. Additionally, we learned how to add a progress bar for our console utilities, which, if not speeds up processing, will allow you to observe it with more interest.

Resources

https://tesseract.projectnaptha.com/

https://github.com/visionmedia/node-progress

Listing

import { createWorker, createScheduler } from 'tesseract.js';
import ProgressBar from 'progress';
import { promises as fs } from 'fs';
import path from 'path';

const RECOGNIZED_DIRECTORY = 'recognized_text';
const IMAGES_DIRECTORY = 'images_to_text';
const CONCURRENCY = 15;
async function recognizeImages() {
const scheduler = createScheduler();
try {
const files = await fs.readdir(IMAGES_DIRECTORY);
const bar = new ProgressBar(
' recognizing [:bar] :rate/bps :percent :etas',
{
total: files.length,
width: 200,
complete: '#',
incomplete: ' ',
}
);
for (let i = 0; i < CONCURRENCY; i++) {
const worker = createWorker({});
await worker.load();
await worker.loadLanguage('rus');
await worker.initialize('rus');
scheduler.addWorker(worker);
}
await Promise.all(
files.map(async (imageName) => {
const imagePath = path.join(IMAGES_DIRECTORY, imageName);
const textPath = path.join(RECOGNIZED_DIRECTORY, imageName + '.txt');
try {
await fs.access(textPath);
bar.tick();
} catch (error) {
await scheduler
.addJob('recognize', imagePath)
.then(async (result) => {
await fs.writeFile(textPath, result.data.text);
bar.tick();
})
.catch((err) => {
bar.tick();
console.log(
'=======================ERROR======================='
);
console.log(`error parsing ${imageName}: \\n\\t\\t${err}`);
console.log(
'=====================END_ERROR====================='
);
});
}
})
);
} catch (err) {
console.error(err);
} finally {
await scheduler.terminate();
}
}
(async () => {
return await recognizeImages();
})();

--

--