Scene Text Recognition Using ResNet and Transformer

Published in

Geek Culture

14 min readJun 29, 2021

We have come across lots of irregular cropped images which have text representation within them. Many sophisticated ideas have been introduced to extract the text from the images. As I have said, optical character recognition (OCR), RNN based seq2seq Attention method is found to be the traditional approach to extract sequential information from structure image, but many researchers have found that it is quite hard to work along with the irregular image and the training time makes them more expensive. RNN based seq2seq Attention method requires sequence representation of input which varies from input to input and thus it is hard to train over millions of images. Most of the time model fails to predict the text or characters since we are dealing with natural scene images.

Basically, if we pick any model, we found one thing common in all of them, i.e self-attention. It enables the model to draw dependencies between different positions in a sequence through position-pair computation. But self-attention method works effectively in word sequence in which the attention mechanism can view through all the word sequences in a sentence. In the case of image translation to text, it is difficult to make sense out of the feature map and hard to create dependencies. In short, I will explain two models that have addressed the image text recognition problem using the strong and yet sophisticated approach to directly connect two-dimensional CNN features to an attention-based sequence encoder and decoder guided by holistic representation and using the concept of ResNet and Transformer.

1. Business Problem
2. Performance metric
3. Data source
4. Exploratory Data Analysis :
5. Brief introduction ResNet architecture
6. Brief introduction Transformer architecture
7. MODEL :ONE
8. MODEL :TWO
9. Future work
10. References

1. Business Problem

In the real world, most of the time we have encountered different forms of images. It can be regular, irregular images along with text format within it. It is a challenging task to extract the character string from them. So, we have been given a dataset of 5000 irregular and natural scene images and the business problem is to successfully predict character strings from them using the state-of-the-art deep learning concept.

2. Performance metric:

We have used a custom accuracy metric which is the ratio of the total number of character sequence matching for given predicted and ground-truth character string divided by the total number of characters in ground-truth.

3. Data source :

ICDAR_2017_table_dataset:http://cvit.iiit.ac.in/research/projects/cvit-projects/the-iiit-5k-word-dataset

We have used this dataset for research purposes.

Details of Dataset citation:

@InProceedings{MishraBMVC12,
  author    = "Mishra, A. and Alahari, K. and Jawahar, C.~V.",
  title     = "Scene Text Recognition using Higher Order Language Priors",
  booktitle = "BMVC",
  year      = "2012",
}

4. Exploratory Data Analysis

We are using the IIIT 5K-word dataset which contains a total of 5000 text images with its corresponding Annotation file in the form of .mat format. We have to extract the image with its character strings.

Some of the random image with their ground truth string characters has been shown below:-

#Displaying iamge with groundtruth string charcters
for (batch, (inp, tar)) in enumerate(train_batches):
  if batch == 3:
    break
  plt.figure(figsize=(3, 3))        
  plt.title('Image' )
  plt.imshow(tf.keras.preprocessing.image.array_to_img(inp[0][0]))
  print(str(tar))
  plt.axis('off')
  plt.show()

5. Brief introduction of ResNet architecture

We know that the deep learning models deal with training a reasonably large number of hidden layers. Recent evidence has revealed that deeper network is of high importance and give outstanding result in ImageNet dataset. Training time is proportional to the number of hidden layers and types of activation we used. So training deeper neural networks are more difficult. In the large neural network, we mostly encounter problems like gradient vanishing while backpropagation.

comparison between large and small neural network — (source link)

As we can see with simply stacking the layer does not reduce the training error and cause model overfitting problems. But to solve this problem, we can add an intermediate normalization layer in between the hidden layer to address the converging problem as well as overfitting while backpropagation.

Then the question can arise, why do we need the ResNet concept if we can solve gradient problem with an intermediate normalization layer?

As we increase the hidden layers, training errors get exposed which in turn degrade the model performance. Researchers have found that degradation has nothing to do with overfitting but simply cause by adding more layers which makes the model hard to optimized. So to solve this problem, ResNet introduces Identity mapping on top of the stacking layer which gives the clean network for the gradient to backpropagate easily.

Bypassing identity mapping and adding with the residual network — (source link)

F(x) defines the output of the stacked layer which can be the size of 2 or more layers. The shortcut connection is then added with the residual output before relu activation. This operation neither add extra parameter nor computational complexity and can easily help in backpropagation using Stochastic Gradient Descent (SGD). With this mechanism, we can train the deeper neural network without compromising training accuracy. In this way and add ’n’ numbers of layers stacked together along with identity mapping create ResNet architecture.

The mathematical equation of identity mapping with the residual network is given below:-

source-link

In the above function F(x,{Wi}) indicates the residual mapping to learn throughout the stacked-layer and x is a shortcut connection to be added with residue with a condition that both should be of the same dimension.

There is another way of interpreting this concept which is ‘highway network’. This mechanism is somewhat similar to the LSTM network. In a highway network, we can control the amount of information to be added to the next layer or not. It has data dependencies and has parameters that are not in the case of ResNet architecture. But performance-wise Resnet is found to be more adaptive and can tackle degradation problems.

Researchers have done experiments with the plain network and residual network with identity mapping neural network, the ResNet model found to be performing better even if adding extra numbers of layers. We can compare plain and residual network which has the same number of parameters, depth, width and computational cost and yet found to be giving better result in favor of ResNet.

comparison between plain network and ResNet of different types — (source -link)

There are different types of ResNet. some of the example are ResNet 32 ,ResNet 50 , ResNet 101 etc. A common difference between them is the number of layers within the stacked layer and the number of stacked layers added upon each other.

Types of ResNet architecture — (source -link)

Now the question why do we need to ResNet architecture instead of VGG pre-trained for feature extraction?

Since we know that deep learning network is quite deeper which may require high computation power plus and as increasing the network deeper, there is a higher chance of the model undergo overfitting and increase training error. In the image to text task, we need a network that can go deeper yet not computationally expensive and give better accuracy gain. ResNet has won over ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation challenges against all others models.

The best part about ResNet which makes it uniques is that even with increasing the number of layers, it still has lower complexity than VGG-16/19.

6. Brief introduction of Transformer architecture

Before the evolution of the Transformer, sequential patterns were trained using a concept called the RNN network. But RNN was failing terribly to memorize the past information of long words sequence and hence fail to predict the next sequential words. To address this problem, Long Short Term Memory (LSTM) was introduced which has an internally forget gate and adding gate. The forget gate only allows the fraction of information from previous time steps to pass to next time steps and adding gate allow the amount of information from current time steps to be added with the fraction of previous information. This concept can able to address the long-term dependencies between the words with additionally adding the attention concept. It still failed to capture the dependencies of large sentences let's say 1000 words sentence. Additionally, we know the length of sentences varies from sentence to sentence, so training time varies from sentence to sentence. Because while backpropagating gradients, we have to unroll the LSTM network for each input sentence and calculate the gradient at each time step, and hence it causes high training time.

So, to address all these problems, researchers have come up with strong yet simple network architecture,” The Transformer” which is an attention-based mechanism, with the same feature as the recurrence model acquires. Most importantly we can apply parallelization for training with feasible time.

The research paper called “ Attention Is All You Need “ introduces the concept called self-attention which looks through the entire input sentence and creates words dependencies that even work pretty well for a long sentence. The attention mechanism is successful while doing tasks like reading comprehension, machine translation, question, and answering modeling, etc. It is a simple recurrent attention mechanism that has an end-to-end memory network. It does not need sequence-aligned RNNs or convolution networks and yet gives better results.

The full Transfomer architecture has been shown below:-

The Transformer architecture — (source-link)

Don’t get scared from the above architecture. I will dissect the architecture into pieces and give a brief justification for each internal part.

The whole architecture is divided into two parts i.e encoder and decoder. The left half is “the encoder” and the right half is “the decoder”.

The Encoder: It has the N stacked identical layers where N can be hyperparameter. It is subdivided into two parts i.e Multi-head mechanism and position-wise feed-forward network. For each stacked layer input vector coming from positional encoding is passes parallelly through Multi-head and shortcut connection and the output of the multi-head gets added with the shortcut connection and followed by the layer normalization. The output will then pass through a feed-forward network which is applied to each position separately and identically. The residual network has been introduced to each sub-layer for easy convergence while backpropagation.

The Decoder:- It is also the N stacked identical layers where N can be hyperparameter. It is subdivided into 3 parts i.e Masked Multi-head mechanism, 2D Multi-head mechanism, and position-wise feed-forward network. For each stacked layer input vector coming from positional encoding is passes parallelly through Masked Multi-head and shortcut connection and the output of Masked multi-head gets added with short cut connection and followed by layer normalization. The output will then pass through the next 2D multi-head attention where the output from the encoder layer is also introduced. The output will then pass through the feed-forward network which is applied to each position separately and identically. The residual network has been introduced to each sub-layer for easy convergence while backpropagation.

let's discuss the whole architecture in chronological order. for simplicity, we assume one encoder and one decoder layer.

Unlike the RNN model where we pass input words in sequences, we don’t need to do the same procedure. we will pass the whole one sentence or batch of sentences at a time and followed by word embedding.

Word embedding will assign a ‘d’ dimension vector for each word which learning while training. To make sure each word is in sequential order, the output from the embedding layer will pass through positional encoding.

Positional encoding makes sure that each word is in its position. It manages the sequential pattern for the input sentence or batch of the sentence.

Visualization of positional encoding — (source-link)

The x-axis is the word position and the y-axis is the 512 dimension for each word. If we zoom in on the above picture, we will encounter that each word is positioned variently. The output from positional encoding is input to multi-head attention and shortcut connection.

Multi-head attention is the ‘m’ headed attention mechanism where m is a hyperparameter. In the Research paper, they have used 8 Scaled Dot-Product Attention which internally gives 8 head 512 dimension vector for each word and the result from each Scaled Dot-Product gets concatenated and undergo dot product with ((8 * 512) * k) dimension weight matrix. These weights learn through backpropagation.

Think of multi-head attention as a function that internally has 8 Scaled Dot-Product Attention and the arguments which it needs are 3 vectors. The 3 vector is nothing but output from the previous layer and all three vectors the same. they are called Query, Keys, and values. the output from Scaled Dot-Product Attention is

you can refer to Jay Alammar's blog for a detailed explanation.

The output from multi-head attention will be added with a short-cut connection and followed by layer normalization. Then it passes through the position-wise feed-forward network followed by layer normalization and hence it is the final output from 1 encoder.

Now let us talk about decoder,

Unlike RNN, we send the decoder input at once to the word embedding layer. It is a teaching force technique which means the output we get from softmax will not be feeding back to the decoder rather model assume it has predicted the right sequence and asked to predict the next word sequence. It allows models to train fast and have less computational cost.

The decoder layer has almost the same sublayer except for one extra multi-head attention. The first attention layer is masked multi-head attention where masked refers to look ahead mask which means it restricts the word sequence from looking ahead of that word because we have to predict the next word. The output from the encoder will feed to 2nd multi-head attention and the remaining process remain the same. The output from the decoder passes through the last 2D dense layer followed by a softmax layer of size equal to vocab size.

I have used 2 models to accomplish successful extraction of the character string. I will be discussing both models in detail.

7. MODEL: ONE

A brief explanation of Architecture which is the combination of ResNet as Encoder and Transformer as Decoder:

The whole architecture is sub-divided into 2 parts. The left half is the encoder and the right half is the decoder.

Let's first get into the detail of the encoder.

Encoder:

ResNet type 34 is used as a feature mapping and feature extraction mechanism. The 3-dimensional feature map is output from modified ResNet34. In my experiment, I have tried with modified ResNet50 for the deeper network which can give a better result as compare to ResNet34. Feature mapping is further pass through two networks simultaneously which (1 * 1) conv layer and bottleneck. The outputs from (1 * 1) conv layer feed into decoder sublayer i.e 2nd multi attention mechanism and treated it as query and key vector.

In the paper, researchers have used six stacked normal ResNet34 as bottlenecks with the residual connection. The output from the last stack bottleneck is further pass through average pooling followed by a fully connected dense layer of size 512. The output from the dense layer is the 2 dimensional which is treated as word embedding of the input image.

Decoder:

The input to the decoder embedding layer is the character string. The input string is character tokenized with additional ‘<end>’ as the end of the string. I have not used ‘<start>’ because the output coming from the encoder's last dense layer is introduced to character embedding after positional encoding as the start of the string. In the paper, they have concatenated encoder output from dense layer image word embedding with positional encoding but instead of that, I have done positional encoding after concatenation just to make sure that image word embedding comes in the first place and works as a ‘<start>’ index.

The output from the previous layer is feed to a masked multi-layer attention model followed by layer normalization by adding a residual network. The masked is associated with a look-ahead mask. The output is then fed to 2-dimensional attentional layers along with output from feature mapping followed by layer normalization by adding residual network. The output from layer normalization is feed into a position-wise feed-forward network followed by layer normalization by adding residual network and finally passes through 2-dimensional dense layers with softmax activation.

Experiment :

I have tried the above architecture with modified ResNet50 and bottleneck with normal ResNet50. The output coming out from the last bottleneck layer followed by average pooling is reshape to 2 dimensional and then passes to a dense layer of size 512. I have also tried using a custom learning rate with a warmth step equal to 4000 along with Adam as an optimizer. I have also tried beam search for predicting the better output. I have trained this model with 232 epochs and found that the model is predicting with 87% accuracy with loss reduced to 0.0903.

Below is the code regarding with 1st model architecture.

The predicted sample has been shown below:

and the corresponding attention plot is shown below:

8. MODEL: TWO

A brief explanation of Architecture which is the combination of ResNet101 as input to Transformer Encoder and Transformer as Decoder:

We have seen in 1st model, ResNet is treated as an encoder and transformer as a decoder. The model architecture for 2nd model quite different. Here, ResNet is used for feature map extraction and the output from the image word embedding is input to the encoder transformer. Apart from this, everything remains the same as we have discussed in the basics of Transformer architecture.

The term partial ResNet101 refers to the bottleneck model which can further reduce to the desired layer so as we get the 3-dimensional convolution feature map. it is further reshaped into a 2-dimensional feature map followed by a fully connected 2-dimensional dense layer. The final output is treated as word embedding for each image which is input to the encoder layer. we are using 4 stacked layers of encoder and decoder with 8 head multi-head attention mechanisms.

Here also, I have also tried using a custom learning rate with a warmth step equal to 4000 along with Adam as an optimizer. I have also tried beam search for predicting the better output. I have trained this model with 500 epochs and found that the model is predicting with 51% accuracy with loss reduced to 0.37 which means it fails to predict compared to the 1st model.

Below is the code regarding with 1st model architecture.

The predicted sample has been shown below:

and the corresponding attention plot is shown below:

I have only shown the first two head result. for detail, you check out the GitHub repository.

9. Future work:

I would like to experiment with different ResNet architecture and run the model with more epochs. I have not done data augmentation which I would like to experiment with.

10. Reference:

You can check out all the detail from my GitHub link which is given below:

tiwaridipak103/Scene-Text-Recognition

Contribute to tiwaridipak103/Scene-Text-Recognition development by creating an account on GitHub.

github.com

My LinkedIn: