Virtual File Cabinet — Part 2

Optical Character Recognition (OCR) with Node.JS

Shawn Grover
The Startup

--

Photo by Amador Loureiro on Unsplash

Before we can classify our scanned documents, we must extract the text from them.

Update: Part 3 has been posted and ties it all together with a functioning application that uses Natural Language Processing to classify the documents and file them into the file cabinet automatically.

In part one of our small business office automation series we created a virtual file cabinet and began populating it with scanned documents. We wrote code to download the scanned documents sent to an IMAP based email account. We copy those attachments into an _inbox directory for later review and classification. As far as our system knows though we have plain image files. The system has no context to say what that image represents. And we cannot trust the filename because these are often generated by the scanning system and amount to “scanned document”. We need a method to generate some context for each file. We will explore using OCR (Optical Character Recognition)

We will make use of Tesseract.js to extract the text from our scanned documents. There are other options but Tesseract gets a lot of mention when you Google for “node js extract text from image”. We’ll explore this and analyze how effective it is after we have used it for a bit. Tesseract.js…

--

--