A (fair) clarification on our OCR test for handwritten documents

Filippo S.
Version 1
Published in
3 min readFeb 3, 2023

A few months ago I wrote a blog post introducing a report comparing the leading Natural Language Process (NLP) technologies. In the report, available at this link, I detailed the result of some tests we performed in July 2020 when we processed the same document sets with three Proof of Values (PoVs), respectively employing Open Source, AWS, and Azure NLP technologies. The sample sets were as diverse as possible, containing documents scanned with different lighting conditions, with or without handwriting, with single or multiple columns, and so on. The idea was to identify the pros and cons of each NLP technology, determining accuracy, limits, costs, and processing times.

The report contained useful insights however, because of the pace cloud service providers create new services or update the existing ones, some key information in the report is already obsolete.

In particular, this applies to the Optical Character Recognition (OCR) analysis for Textract, Amazon OCR service. At the time of our tests, AWS explicitly claimed that Textract was not trained on handwritten documents so it could not be used for extracting handwriting. This all changed in November 2020 when AWS finally enhanced Textract with handwriting recognition for English documents.

For fairness’ sake, we have then repeated the tests with a new set of 20 handwritten documents. The test is by any means conclusive, a much bigger sample would be required for an irrefutable comparison, however, compared to the original tests, new insights can be derived from the new test results (reported in the third line of the table below).

As per the original test, Azure Read OCR performed with higher speed and slightly higher accuracy. However, in the original test, Textract’s accuracy scored only 56.5% which made it unsuitable for those use cases involving handwritten documents. The new score is now 86% (2% only lower than Azure OCR) which now makes Textract the recommended OCR for those customers already employing AWS technologies.

Furthermore, AWS Textract is provided with a very interesting feature: it can process tables and most importantly, forms, without the need for a different service. Azure also offers a service for extracting key-value pairs from forms, Microsoft Form Recognizer. But this is indeed a separate service from Read OCR: it requires separate APIs, and it comes with a different price tier. Textract is much easier to use: no need to use separate services, nor to train any model (as required by Form Recognizer), simply passing an extra parameter (‘form’, ‘table’, or both) will return the corresponding output.

Things change so quickly in the cloud space that, to have updated conclusions at any given moment, tests like those described in the report should be probably repeated over and over again, almost every couple of weeks! This is obviously not feasible and it must be understood that all the information provided by reports like the one described in my original post refers to the specific point in time when the tests were run, but things can change very quickly.

However, as processing handwritten documents are becoming increasingly more important for our customers, we thought that the enchantments made by AWS on Textract were very good news and that repeating the test on handwritten documents was a fair thing to do!

About the Author:
Filippo Sassi is Head of the Innovation Labs here at Version 1

--

--