Google Vision OCR Detailed Analysis

Published in

Data Science in your pocket

4 min readNov 21, 2019

Again, something with which I have been engaged quite closely at 1mg.

The Data Science Team has been engaged with Prescription Digitization using Google Vision OCR for both Handwritten & Printed Rx.

As & when any Rx is uploaded by the user, it is fed to GVision API and it returns a detailed OCR result. This result has 3 major parts:

Label Annotation:

This level has information about entities/attributes found in the image. For example: for Rx images, some common entities can be paper, writing, text, etc.

Label Annotation has 4 major parts:

mid: A unique id per label entity
description: The label
score: Confidence score
topicality: Labels can be considered as the attributes of the image.Topicality determines how strong the attribute is.

2. TextAnnotation:

It provides us with the entire text found in the image. The result can be divided into 2 parts:

Locale: Language detected
The 2nd part has text mapped to coordinates on the image.
The first key of this part takes all the text as one & provides box coordinates for this big text chunk(left)
After this key, the other keys provides box coordinates mapped with each word found on the image(right).

The coordinates start from top left corner of the word(& not the image)

3. FullTextAnnotation:

FullTextAnnotation can be taken as an expanded and more detailed version of TextAnnotation.

The Hierarchy followed in FullTextAnnotation is:

Page: It is at the top of the hierarchy. Some information provided are:

Height & width of the Image
Languages detected with confidence score under ‘Property’ key
Blocks of word

Blocks: It can be taken as a collection of some paragraphs. A page can have more than 1 block. Information per block provided is:

4 x,y coordinate pair starting from the top left of the block
confidence score
Paragraphs

Paragraphs: It can be taken as a collection of some words. A page can have more than 1 word. Information per paragraph provided is:

4 x,y coordinate pair starting from the top left of the paragraph
confidence score
words

Words: It can be taken as a collection of some symbols(letters).A page can have more than 1 letter/symbol. Information per word provided is:

4 x,y coordinate pair starting from the top left of the word
confidence score
Symbols

Symbols: Lowermost level in the hierarchy. The information provided is:

Language for that symbol/letter
Text of the symbol
Confidence score
4 x,y coordinate pairs starting from the top left of the letter/symbol.

It must be noted that any sort of text in FullTextAnnotation is present at Symbol level only. Hence to get the text corresponding to any hierarchical level(Blocks, Paragraph, Words) it has to be extracted from Symbol only.

NOTE:

Any sort of score is on the scale of 0–1.
The coordinate system according to which all coordinates for various entities(words, blocks,etc.) are provided assumes (0,0) at the top left corner of the image irrespective of the orientation of the image
The 4 x,y coordinates pair mentioned per entity starts from the top left corner of the entity(and not the image) and the sequence for the rest of the 3 coordinate pairs is given in clockwise order. Hence the same image with different orientation(rotated at different angles) will provide different coordinates for the same entities.

I Hope, this gives a fair bid of an idea to all those who might be thinking of using GVision for any sort of quality OCR.

Looking for more, Look below:

Google Vision OCR Detailed Analysis

Written by Mehul Gupta