Google Vision OCR Detailed Analysis

Mehul Gupta
Data Science in your pocket
4 min readNov 21, 2019

--

Again, something with which I have been engaged quite closely at 1mg.

The Data Science Team has been engaged with Prescription Digitization using Google Vision OCR for both Handwritten & Printed Rx.

As & when any Rx is uploaded by the user, it is fed to GVision API and it returns a detailed OCR result. This result has 3 major parts:

  1. Label Annotation:

This level has information about entities/attributes found in the image. For example: for Rx images, some common entities can be paper, writing, text, etc.

Label Annotation has 4 major parts:

mid: A unique id per label entity

description: The label

score: Confidence score

topicality: Labels can be considered as the attributes of the image.Topicality determines how strong the attribute is.

2. TextAnnotation:

It provides us with the entire text found in the image. The result can be divided into 2 parts:

Locale: Language detected

The 2nd part has text mapped to coordinates on the image.

The first key of this part takes all the text as one & provides box coordinates for this big text chunk(left)

After this key, the other keys provides box coordinates mapped with each word found on the image(right).

The coordinates start from top left corner of the word(& not the image)

3. FullTextAnnotation:

FullTextAnnotation can be taken as an expanded and more detailed version of TextAnnotation.

The Hierarchy followed in FullTextAnnotation is:

Page: It is at the top of the hierarchy. Some information provided are:

  1. Height & width of the Image
  2. Languages detected with confidence score under ‘Property’ key
  3. Blocks of word

Blocks: It can be taken as a collection of some paragraphs. A page can have more than 1 block. Information per block provided is:

  1. 4 x,y coordinate pair starting from the top left of the block
  2. confidence score
  3. Paragraphs

Paragraphs: It can be taken as a collection of some words. A page can have more than 1 word. Information per paragraph provided is:

  1. 4 x,y coordinate pair starting from the top left of the paragraph
  2. confidence score
  3. words

Words: It can be taken as a collection of some symbols(letters).A page can have more than 1 letter/symbol. Information per word provided is:

  1. 4 x,y coordinate pair starting from the top left of the word
  2. confidence score
  3. Symbols

Symbols: Lowermost level in the hierarchy. The information provided is:

  1. Language for that symbol/letter
  2. Text of the symbol
  3. Confidence score
  4. 4 x,y coordinate pairs starting from the top left of the letter/symbol.

It must be noted that any sort of text in FullTextAnnotation is present at Symbol level only. Hence to get the text corresponding to any hierarchical level(Blocks, Paragraph, Words) it has to be extracted from Symbol only.

NOTE:

  • Any sort of score is on the scale of 0–1.
  • The coordinate system according to which all coordinates for various entities(words, blocks,etc.) are provided assumes (0,0) at the top left corner of the image irrespective of the orientation of the image
  • The 4 x,y coordinates pair mentioned per entity starts from the top left corner of the entity(and not the image) and the sequence for the rest of the 3 coordinate pairs is given in clockwise order. Hence the same image with different orientation(rotated at different angles) will provide different coordinates for the same entities.

I Hope, this gives a fair bid of an idea to all those who might be thinking of using GVision for any sort of quality OCR.

Looking for more, Look below:

--

--