Improving OCR Recognition Through Human Review Via Amazon Augmented AI

Create a pipeline that extracts data from PDF documents via OCR and human review in 13 minutes

Hari Devanathan
CodeX

--

Photo by Madhu Shesharam on Unsplash

Optical character recognition (OCR) tools such as Amazon Textract do a fantastic job in extracting text information from various files: PDFs, PNGs, JPEGs. While these tools have done a fantastic job in collecting clean data (and subsequently help building better models), they are far from perfect. Tools still haven’t perfected picking up human signatures, as each person has a distinct cursive notation. Furthermore, document and image scanners can sometimes create unclear images that prevents Textract from picking up correct characters. I’ve seen cases of OCR tools mistaking periods for commas, capital I for lower case l, capital C for capital S, and the letter O for the number 0. Sometimes, the OCR tool would read an empty space instead of a character if a water or coffee spill covered the character.

In a perfect world, every PDF/image form would have the same format (form text, table text, headers), every scanner would be clean enough to upload clear images, every PDF/image form would have no spills or stains, every human would have the same signature format, and every text would be in the same font. Considering that unicorns don’t exist, it’s…

--

--

Hari Devanathan
CodeX
Writer for

Data Swiss Army Knife. AWS Certified in Machine Learning.