Hugging Face AI Models 🤗 — Model 1 — TrOCR (Text Extraction from Images)

Vaibhav Singal

Published in

AI Trends

5 min readJun 29, 2022

Important Links

TrOCR — Paper: TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
TrOCR — Models Library: https://huggingface.co/models?filter=trocr
TrOCR — Most Downloaded Model: https://huggingface.co/microsoft/trocr-base-printed
TrOCR — Hugging Face Home Page: https://huggingface.co/docs/transformers/model_doc/trocr
TrOCR — Notebooks:

https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR

https://github.com/microsoft/unilm

https://github.com/rsommerfeld/trocr

6. TrOCR — Dataset: https://paperswithcode.com/dataset/sroie

7. TrOCR — Source Code: https://github.com/huggingface/transformers/tree/main/src/transformers/models/trocr

8. TrOCR — PaperWithCodes: https://paperswithcode.com/paper/trocr-transformer-based-optical-character

Background

Optical Character Recognition (OCR) is the electronic conversion of pictures of typed, handwritten, or printed text into machine-encoded text. The source could be a scanned page, a picture of the page, or text superimposed on an image. Such sources are transformed into machine-readable text through OCR.

Before we delve further into Transformer Based OCR, let’s first comprehend the operation of an OCR pipeline.

Two modules make up the majority of OCR pipelines.
1. Text detection module
2. Text recognition module

Text Detection Module

The text Detection module, as its name implies, finds any instances of text in the source.

It seeks to localise every text block contained within the text picture, either at the word or text line level (individual words).

The text blocks are the object of interest in this work, which is similar to an object detection problem.

YOLOv4/5, Detectron, Mask-RCNN, and more well-liked object detection algorithms are available.

Text Recognition Module

The goal of the text recognition module is to interpret the detected text block’s content and translate the visual cues into tokens of natural language.

There are often two sub-modules in a text recognition module.

Image Understanding
Word Piece Generation

The text recognition module’s procedure is as follows.

As input to the image understanding module, which is commonly a CNN module, the individual localised text boxes are scaled to, say, 224x224 (ResNet with self-attention).

The Word Piece Generation Module, an RNN-based network, receives the extracted picture features from a specific network depth as input. The machine-encoded texts from the localised text boxes are the RNN network’s output.

Until the performance achieves an ideal scale, the Text Recognition Module is trained using an appropriate loss function.

What distinguishes transformer-based OCR?

Transformer-based OCR is one of the first studies to jointly use pre-trained image and text transformers. It is an end-to-end transformer-based OCR model for text recognition.

OCR that is transformed resembles the illustration below. The Vision Transformer Encoder is on the left-hand side of the schematic, and the Roberta (Text Transformer) Decoder is on the right-hand side.

ViTransformer or Encoder :

An image is divided into NxN patches, each of which is analysed like a token in a phrase. The image patches are linearly projected with positional embeddings after being flattened (2D — ->1D). The linear projection + positional embeddings are propagated through the transformer encoder layers.

When using OCR, the image consists of numerous localised text boxes. The images/image portion of the text boxes are adjusted to a HxW to guarantee uniformity in localised text boxes. The image is then divided into patches, with each patch having a size of HW/ (PxP). The patch size is P.

After that, the patches are flattened and linearly projected to a D-Dimensional vector which are patch embeddings. The patch embeddings and two special tokens are given learnable 1D position embeddings according to their absolute positions. Then, the input sequence is passed through a stack of identical encoder layers.

Each Transformer layer has a multi-head self-attention module and a fully connected feed-forward network. Both of these two parts are followed by residual connection and layer normalization.

Note: Residual connections ensure gradient flow during backpropagation.

Roberta or Decoder :

The output embeddings from a certain depth of the ViTransformers are extracted & passed as input to the decoder module.

The decoder module is also a transformer with a stack of identical layers that have similar structures to the layers in the encoder, except that the decoder inserts the “encoder-decoder attention” between the multi-head self-attention and feedforward network to distribute different attention on the output of the encoder. In the encoder-decoder attention module, the keys and values come from the encoder output, while the queries come from the decoder input.

The embeddings from the decoder are projected from the model dimension (768) to the dimension of vocabulary size V (50265).

The softmax function calculates the probabilities over the vocabulary and we use beam search to get the final output.

Advantages:

TrOCR, an end-to-end Transformer-based OCR model for text recognition with pre-trained CV and NLP models is the first work that jointly leverages pre-trained image and text Transformers for the text recognition task in OCR.
TrOCR achieves state-of-the-art accuracy with a standard transformer-based encoder-decoder model, which is convolution free and does not rely on any complex pre/post-processing step.

The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR).

Hugging Face AI Models 🤗 — Model 1 — TrOCR (Text Extraction from Images)

Important Links

Background

Written by Vaibhav Singal