[Paper Summary] SEE: Towards semi-supervised end-to-end scene text recognition

6 min readApr 12, 2018

Example of end-to-end scene text detection (from SEE [1])

Introduction:

Text detection and recognition from natural scene images or document images is a challenging task with myriads of applications ranging from information extraction, machine translation, autonomous navigation etc. Towards this task, Bartz et al. [1] released their work which was published in AAAI 2018. They developed an end-to-end approach by combining a Spatial Transformer Network [2] based text detection network with a ResNet based text recognition network.

Methodology:

The intuition behind their model is based on how humans read text, by attending on a word, reading it out and then attending onto the next word in This post is meant to constitute an intuitive explanationa sequential manner.

Text Detection:

They make use of the Spatial Transformer Network (STN) [2] in order to attend to the words. An STN is composed of three parts, a localization network, a grid generator and a differentiable image sampling layer. Let’s briefly look at how they make use of this in their model.

Localization Network:

In order to attend to the words in the input image, the image feature map is first transformed so that the focus is shifted to the text instead of being at a global level. To obtain the transformation matrix for the input image I, it is first passed it thorough a ResNet based CNN to obtain the global features c of the image. This is then passed through an LSTM to compute the N transformation matrices to focus on N characters, words or lines of text in the input image sequentially.

*Eq.2 is the hidden state of the LSTM,* eq.3 outputs the transformation matrix at that stage.

The hidden state h of the LSTM along-with the global features c is passed through the feed-forward network g in order to obtain the transformation matrix. The matrix brings forth a two-dimensional affine transformation allowing the network to apply translation, rotation, zoom and skew to parts of the input image. Thus the output of the localization network is the transformation matrix at each step of the LSTM.

Transformation matrix from the localization network.

Grid Generator:

SEE utilizes the grid generator of the STN to obtain the bounding boxes around the text localized by the localization network. Together with the affine transformation matrix, the grid generator produces N regular grids for the N characters, words or lines of text that is to be recognized. The co-ordinates for the grid can be obtained using the following expression:

x and y are co-ordinates of the grid G with height H and width W. u and v are the coordinates of the input image feature map.

SEE does not use ground truth bounding boxes for any of the losses and just use the output text to calculate the loss for the model.

Image Sampling Layer:

They use bilinear sampling to sample values of the input feature map I at location u and v for each n ∈ N using the sampling grids G produced by the grid generator. This bilinear sampling is (sub-)differentiable, hence it is possible to propagate error gradients to the localization net-work, using standard backpropagation. Each of the n ∈ N output feature maps O at a given location i, j for i ∈ H and j ∈ W can be written as:

After applying the transformation to each of the n ∈ N regions obtained by the grid generator, the regions are processed independently of each other in the text recognition stage.

Text Recognition:

SEE uses a ResNet based CNN for their text recognition stage. After obtaining global features for recognition, it is passed through a linear layer and subsequently through a BiLSTM for T time steps to obtain the words. The choice for T is based on the number of characters of the longest word in the dataset. The output of the recognition stage is the probability distribution over the label space.

Loss Formulation:

The loss function consists of the two sets of losses. The first is the cross-entropy loss between the ground truth alphabets and the output at each time step of the BiLSTM. The second consists of various regularization losses that they added to their model. They can be summarized as follows:

Rotation Dropout as Regularizer: During their experiments, they found out that the network predicted transformation parameters that led to excessive rotation of the sampled input feature map region. They randomly dropped the parameters of the affine transformation matrix that are responsible for rotation to deal with this.

Localization Specific Regularizers: They penalized their model for outputting large grids based upon their area, encouraged it to produce grids that have a greater width as compared to the height based on how text is usually written, penalized the grids that are mirrored across any axis (as the datasets used by them doesn’t contain any text that is mirrored across any axis). Adding these regularizers resulted in faster convergence. The network performs comparably without these regularizers but takes longer to converge and might require several restarts.

The overall loss function can thus be written as:

Eq.1 is the regularization loss for each of the grid which includes the loss for area, aspect ratio and the direction of the grid. Eq.2 is the total loss of the model.

Experiments:

They performed their experiments on two datasets, SVHN [3] and FSNS[4]. On the SVHN dataset, they were able to reach competitive accuracies as compared to the current state-of-the-art [2, 5]. As a proof of concept to show that their model can identify multiple lines of text, they also generated their own version of the dataset with multiple house numbers placed at random non-overlapping locations in an image.

Their model works well when they are trying to find individual words than the text lines directly. They used this approach to reach competitive outputs on the FSNS dataset. They also use a curriculum learning strategy by first starting with easier samples for training and gradually increasing the complexity of the training images.

Conclusion and Self Notes:

Their model provides a congenial way of performing end-to-end scene text recognition, but suffers with a few problems (which they explicitly mention in their own conclusion section):

Even though the network is relatively simple and is similar in the pipeline as compared to the Deep Text Spotter [6], it’s difficult to train and requires clever curriculum formation as well as multiple restarts of the network to converge.
It is also constrained on the maximum number of words it can identify in one forward pass.
It is not fully capable of detecting text at arbitrary locations.
It works well for sparse scene text detection and recognition but fails to perform for dense text. Also, for dense text, it’s difficult to make the model converge.

All in all, their model gives a nice base on how an end-to-end setup can be designed to perform scene text recognition. This pipeline can be utilized to improve upon the results.

I hope you enjoyed reading this post. Feel free to leave comments and claps :)