A computer monitor with an eye in a browser.
OCR a subset of computer vision. Source: Pixabay

CompareTheOCR.com*

*domain does not actually exist, it’s a tool (a bit like the author)

Marc Templeton
Kainos Applied Innovation
13 min readOct 31, 2019

--

OCR has many use-cases in a whole host of sectors, from healthcare to finance and its applicability is only growing due to the increasingly affordable and available access to these services. Many of the ‘tech-giants’ such as Microsoft, Google and Amazon provide access to their OCR services and in this blog, I dive head-first into these to answer one simple thing: are they any good?

The tool in this blog is open-source and available on Kainos’ GitHub.

Index

Challenges

Tool

Results

All Results

Conclusions

Future Considerations

Closing Remarks

Challenges

[Index]

OCR’s huuuuugeee range of uses, across a plethora of business areas means that it comes with one colossal problem: 1 word, 7 letters… ‘variety’.

“What variety are you talking about?” I’m glad you w̶e̶r̶e̶ ̶f̶o̶r̶c̶i̶b̶l̶y̶ ̶m̶a̶d̶e̶ ̶t̶o̶ ̶a̶s̶k̶ asked. I’m talking about the variety in text, ranging from text stylistics to structure.

A collection of the letter, “L” in a wide variety of fonts.
Font variations of the letter, “L”. Source: Chars74k dataset

Text stylistics modify texts’ visual appearances with the use of size, colour or through the use of fonts and shape. This can be taken one step further with basic augmentation such as italics, bold, underlines, strikethrough (not forgetting double strikethrough) or super and subscripting, all of which can be done using simple Microsoft Word! Throw some creative and artistic minds into the mix and simple text can be misshaped in every conceivable way. Textures and shadows can provide a sense of depth and feeling to the text, further distorting it from its original style.

Extracting raw text is fine in most cases, however, sometimes the relationship and structure of the text must be preserved to provide context and meaning. Layouts such as the columnar nature of published papers and tables require text to be extracted in blocks of related data items. This requires that the OCR software firstly, identifies this structured nature and then has the means to extract the relevant text and present it appropriately. Invoices and forms provide more complex examples of text relationships.

Throughout all of these text sources the use of symbols and icons pose a further problem of misinterpretation; being incorrectly identified as alphanumeric values.

The problems defined above have only considered digital text, when you think of handwriting, which is regarded by many to be unique with scientific studies supporting this idea such as that conducted by the State University of New York, Buffalo which tested against 1,500 samples, it clearly adds an obsceeeeene amount of variety.

A collection of logos which distort text by adding texture, depth and hidden meanings
Text with texture, depth and hidden meanings. Source: 1stWebDesigner

The method of attaining these pieces of text adds problems too. Scanning in large documents with creases or stains reduces the quality of the individual lettering; obscuring them, changing their appearance or covering them up entirely. Whilst photographing text is not feasible for larger text pieces it is well suited for less dense pieces, such as receipts or movie posters. However, photography is just a minefield of flash spots, blurs and crop outs waiting to happen, impacting the quality of the output image.

A mobile phone with a bank card being read using OCR.
OCR in-use with automatic bank card reading. Source: NeonCRM

In short, this textual diversity, fluctuating quality of the captured text excerpts and need for structured understanding lead to reduced inaccurate in translations. But what are the consequences of this? Well, in a nutshell, the consequence of incorrect OCR translations depends exhaustively on the area in which they are applied. In banking, OCR is used with cheques: reading the recipient and receiving bank account details, the value of the cheque and verifying the signature amongst other things. Clearly, any incorrect conversions here has a high impact on the company as it can attempt to charge an incorrect bank account if the account number is mislabelled or it can halt the system entirely by attempting to transfer, “1o0” (one-o-zero) rather than one hundred. OCR is also used to scan and store digital vouchers for use in-store using a scannable code. An incorrect conversion here would likely result in an invalid voucher code and the user would be prompted to scan it again — no biggie!

These examples comprise of small text examinations but differ in consequence of misinterpretation and the effort to rectify erroneous entries by a human is minor.

Another prime example of OCR is with larger text documents to make digital backups of contracts or invoices or restoring historical text documents. These exhibit a much higher chance of erroneous entry due to the sheer size and bulk of the text being read which also significantly increases the effort required to address these.

So, by now you must be thinking,

“How can one model deal with such a wide variety of texts?”

Well, the answer is, “It can’t”.

Computer models — by nature — focus rigorously on one specific task and these text detection models are no different. In fact, to deal with the varied text pieces suppliers of OCR services often use 2 models: one designed to handle larger blocks of text better, which we’ll call ‘document’ models and the other for use with natural scenes we’ll call ‘image’ models, applying the idea of the No Free Lunch Theorem to this area.

Tool

[Index]

OCR is a fantastic piece of technology on paper — pun definitely intended — with many benefits. The automation of data entry and subsequent increased productivity and reduction in personnel overhead reduces operating costs. Digitalising documents allow for backups to be easily made with significantly less storage space required and with heightened cyber-security applied. Making them digital also allows for them to be editable and made searchable. By far though, the key benefit of OCR is the increased accuracy in copying documents from hard-copies to digital-copies, compared to manually transcribing. It is this final point that motivated the development of this tool.

The fact that there isn’t a single OCR model that can transcribe all forms of text excerpts with the same level of accuracy means that it is imperative to know which model to use in certain situations to guarantee the greatest level of accuracy.

This tool provides this understanding of model performance through an easy to interpret and comparable metric in an automated manner.

The service, in addition to being fully automated, provides batch processing for multiple images and transcriptions but also provides the option to specify which models’ media type is used.

Services Supported

The services currently supported come from Microsoft, Google and Amazon; ¾ of MAGA. These were selected due to their dominance in the tech industry making it likely that any user making use of OCR utilises one of these (if not, more) of these service providers. They are also cost-effective and come with SDK or API calls which do most of the heavy lifting.

The 4 companies that make up MAGA’s logos (Microsoft, Apple, Google and Amazon).
Microsoft, Apple, Google and Amazon (MAGA) companies. Source: Financial Times

Microsoft make use of an API call to its Computer Vision gateway service; for documents, the READ API is used whilst its OCR API is used for natural scene images. These models can read documents and images with a 40° rotation applied; the OCR model supports 25 languages which it auto-detects. It costs $0.0015 per transcription.

Google makes use of its Cloud Vision SDK which makes use of 2 annotation features to derive text from an image. In addition to being able to detect multiple languages, Google’s Vision can also detect multiple languages in the same image. It costs $0.0015 per transcription.

Amazon make use of a Boto3 gateway SDK which provides access to Textract for document text and Rekognition for live scene images. Textract costs $0.0015 per transcription whilst Rekognition costs $0.0001 per transcription.

Evaluation Technique

Evaluation of the performance on translations is key for determining the accuracy of the service. There are many metrics available to determine OCR performance, but the metric used in this tool is a hybrid; CharacTER utilises differences at both character and word level. This calculation also recognises words from the same stem words as the same, e.g. “codes” and “coded” which come from the stem word, “code”.

Simplified calculation to generate CharacTER metric.
CharacTER’s calculation. Source: GitHub

Dataset

In an attempt to represent real-life uses of OCR, a number of datasets were sourced. These varied from website icons, scanned document text with different filters applied such as stains and crumpled lines, natural photos of house numbers and hand-drawn alphanumeric values.

These datasets were amalgamated together to produce an overall sample of 280 images to run through the tool.

Transcribe

The original transcriptions were stored in a properties CSV file against each image’s name. The textual content of the images was then copied to each services’ original transcript files. One-by-one each image was passed to the tool which sent it off to each OCR services’ models for each media type. The results were returned and any non-English were substituted for, “?” because non-English characters once parsed were displayed as their Unicode representation, e.g. “é” was encoded as, “u00E9”. The results were stored in the service’s results file which included the name of the service and media type as well as a timestamp. The timestamp was added to the file to ensure uniqueness and allow for past results to be stored.

Metric

The original transcript file and the corresponding result file were passed into the CharacTER script which generates a score for each image translation. It then generates the average across all those transcribed.

CharacTER score is between 0.0–1.0 with zero being a perfect translation and one being an extremely poor translation or simply a failed one. Therefore we can think of this as an error rate for the transcribed text.

Results

[Index]

The results below are split amongst the 3 services used and for both of their ‘document’ and ‘image’ model types.

The description of each transcription category are as follows:

PERFECT TRANSCRIPTION — exact transcription match for an image with text

FAILED TRANSCRIPTION — service failing to detect text in an image with text

SUCCESSFUL NULL TRANSCRIPTION — exact transcription match for an image with no text

80% SUCCESSFUL TRANSCRIPTION — transcription that had a CharacTER score of 0.2 or below

20% SUCCESSFUL TRANSCRIPTION — transcription that had a CharacTER score of 0.8 or above

0% SUCCESSFUL TRANSCRIPTION — exact transcription mismatch for an image with text

The thumbnails present in the tables below were hand-picked to provide an overview of the model’s performance. The full results follow.

Amazon Document

Examples of perfect, failed and successfully null transcriptions for Amazon Textract.
Amazon document results
Examples of threshold transcriptions for Amazon Textract.
Amazon document threshold results

Amazon Image

Examples of perfect, failed and successfully null transcriptions for Amazon Rekognition.
Amazon image results
Examples of threshold transcriptions for Amazon Rekognition.
Amazon image threshold results

Google Document

Examples of perfect, failed and successfully null transcriptions for Google Cloud Vision.
Google document results
Examples of threshold transcriptions for Google Cloud Vision.
Google document threshold results

Google Image

Examples of perfect, failed and successfully null transcriptions for Google Cloud Vision.
Google image results
Examples of threshold transcriptions for Google Cloud Vision.
Google image threshold results

Microsoft Document

Examples of perfect, failed and successfully null transcriptions for Microsoft Vision.
Microsoft document results
Examples of threshold transcriptions for Microsoft Vision.
Microsoft document threshold results

Microsoft Image

Examples of perfect, failed and successfully null transcriptions for Microsoft Vision.
Microsoft image results
Examples of threshold transcriptions for Microsoft Vision.
Microsoft image threshold results

All Results

[Index]

Full-set of results

Conclusions

[Index]

Amazon document produced the worst CharacTER score across the 280 images fed into it, returning a 85.7% error rate. This was primarily down to its inability to accurately transcribe images with text, funny enough. It only correctly predicted 9 images and all of these were null. The images which it could detect text on always resulted in complete and utter nonsense, e.g. the house number, “296” came back as, “D D” not even close! In its defence, none of the datasets used contained invoices, forms or any semi-structured images which this model specialises in.

Amazon image fared much better than its document counterpart (although that wasn’t difficult to), producing accurate results for many of the house numbers and single text fonts. It struggled with the longer text passages often transcribing much less text than was there, however for some of the smaller text pieces it detected excessive text, e.g. the car make, “RENAULT” was returned as, “RENAIULT”. Although it dealt with text fonts quite well, it was hit and miss when it came to handwritten pieces often not identifying any text. Blur and noise was difficult for this model to see through too.

Google document produced good results for longer text excerpts however, it struggled with single characters and house numbers. Especially the former, often not documenting any text.

Google image produced the best CharacTER average score of 42.9% error rate. It produced the best all-round results for both single letters and digits and longer passages of text, although other services produced better results for these, Google image was able to translate enough to produce moderate results.

Microsoft document produced a very mixed bag. It succeeded in transcribing some of the long text passages moderately well but struggled to produce any results for those with noise applied.

Microsoft image produced similar results to Amazon document with the inability for the service to detect text in images proving problematic; 79% of records were returned with no text and three quarters of these were incorrectly labelled as such, the most of any service. It performed better in successfully transcribing text from images but takes the ‘Surprise of the Day’ prize for its inability to detect any text in some of the longer text passages, despite it seeing (and doing a good job of translating) others and those of the same font, size and filter! It struggled with highly stylized fonts present in some of the natural pictures and completely failed with house numbers and single letters.

Below’s f-measure results provide a summary of service performance. Services which could both: detect text in an image and translate it with a 90%+ level of accuracy and recognise images with no text scored higher than those which struggled in one or both areas. Google image received the highest f-measure of 0.640 whilst Amazon document scored a dismal 0.090. The service provider’s average score is represented as the dotted line.

F-measure for services’ translations that were 90% accurate

All services struggled with letter cases, often flipping them. However, for many of these examples, the letter’s casing (uppercase or lowercase) can only be inferred by context. For example, lowercase “c” and uppercase “C” can only be differentiated by a matter of size compared to the rest of the letters in the word; leaving a 50:50 guess for the model. All services seemed to unanimously agree on the images which correctly had no text.

Limitations

The text extracted from some datasets was done so in a programmatic manner. This, on occasion, allowed for the extracted text to be unintuitive to read from Western’s left-to-right and top-to-bottom manner, e.g. the photo below was taken from a dataset whose text context was stored in XML data tags and when extracted read, “4B.522 4B.524 4B.526 4B528 ARE HERE YOU”. Penalising any model that returned the text in a more ‘human-read’ manner such as Google image which translated, “4B.528 4B.526 YOU ARE HERE 4B.524 4B.522” scoring a 42.5% error rate.

An image of a section of floor-plan with room numbers and “YOU ARE HERE” indicator.
A picture whose data content was extracted using a script. Source: Robust Reading Competition dataset

CharacTER’s in-built threshold reduces this penalty, but it is still an issue.

Upper and lowercase “c” and “s”.
Case variations for the letters, “C” and “S”. Source: Chars74k dataset

The datasets used to determine the original transcripts presented consistency issues for the original transcriptions’ text. The casing of the single letters defined above was a problem for the services to differentiate, extra digits appeared in some of the house number images. Some icons with, remove “ — ” and add “+” were deemed to have text whilst others were not. Natural characters were sourced by cropping out individual letters from photos of posters, signs, etc. A small amount of these crops left in background text which was not accounted for but was later picked up by the OCR services, crippling its score. To rectify these issues the dataset could be cleaned to highlight these discrepancies whilst normalising the letter casing to a single case would remove these errors.

Further Considerations

[Index]

The dataset could be further expanded on to include more real-life uses. Structured text extracts such as receipts and invoices along with longer handwritten pieces could be included.

Further augmentation could be applied to the images to more accurately mimic real-life examples such as page rotation and creases.

The inclusion of more metrics would further support analysing each model for more tailored uses cases, e.g. if case folding was ignored then this would suit OCR for use in mail sorting areas, however, it would not be suitable in voucher code scanning.

Closing Remarks

[Index]

OCR use in real-life applications, such as licence plate recognition or bank cheque extraction, are extreeeemely successful because the computational models used in these scenarios are so well suited to their specific tasks. These general purpose services are never going to be as good as any purpose built model; horses for courses. That being said, however, they are more than suitable for small scale automated data insertion uses that can cope with a few hit and misses here and there.

This tool is open-source and available on Kainos’ GitHub.

If you’re interested in this sort of technology — or any others for that matter — check out what the Applied Innovation team have been working on here!

--

--

Marc Templeton
Kainos Applied Innovation

Full-time Software Engineer at @KainosSoftware. Part-time velociraptor impressionist.