Deep Learning-based Text Detection and Recognition in Research Lab

Priyank Jain

Published in

TechBlog

8 min readAug 19, 2019

Deep Learning-based Text Detection and Recognition

Sudhir Sornapudi, Priyank Jain

I. INTRODUCTION

Scientists in the chemical laboratory spend a considerable amount of time in managing the inventory while conducting research. Usually, they use paper-based tools or digital tools. Although digital tools are more advantageous when it comes to updating and tracking the experiments performed and managing inventory, there is an initial over-head during on-boarding. Our vision is to design a tool that can update the off-the-shelf inventory by just scanning images of chemical reagents. We propose a model to detect and recognize the text from the images using deep learning framework.

The task of extracting text data in a machine-readable format from real-world images is one of the challenging tasks in the computer vision community. Reading the text in natural images has gained a lot of attention due to its practical applications in updating inventory, analyzing documents, scene understanding, robot navigation, and image retrieval. Although there is significant progress in text detection and text recognition fields, this problem is still challenging due to the complexity of the natural scene images.

Unlike, optical character recognition (OCR), reading text from real-world natural images is far more difficult. OCR models like tesseract [1] work very well with images of a scanned document as shown in Figure 1. You can clearly observe that the images have a clean background with regular font, plain layout and of a uniform single color. Whereas the scene text images have a complicated background where some patterns are visually indistinguishable from the true text. These images have different colors, sizes, orientations, sometimes curvy, fonts and languages. The images also suffer from a lot of interferences like low resolution, exposure, noise, motion blur, out of focus, varying illumination, etc. as shown in Figure 2.

Figure 1. Samples of scanned documents

Figure 2. Samples of natural scene images

In the view of the aforementioned challenges, the problem is split as a two-step process: 1) text detection, and 2) text recognition. Text detection is a process of predicting and localizing the text instances from an image as shown in Figure 3. Text recognition is a process of decoding the text regions into a computer-readable format as shown in Figure 4.

Figure 3. Text detection, localizing text by drawing green bounding boxes around the text.

Figure 4. Text recognition

The idea of a two-step process can be clearly understood from Figure 5. We first perform detection to localize and identify the regions where there is a presence of text and then these regions are cropped individually to recognize the text from the detected regions into a machine-readable format.

Figure 5. Text detection and recognition.

II. METHODOLOGY

The ultimate goal is to design an end-to-end framework, which internally has two steps: text detection and recognition.

A. Text Detection

A text detector module (Figure 6) based on a fully convolutional neural network [2][3] was adopted to localize the text regions. As there are a lot of small text boxes in natural scene images, we upscale the feature maps from 1/32 to 1/4 size of the original input image in shared convolutions. After extracting shared features, one convolution is applied to output dense per-pixel predictions of text to be present. The first channel computes the probability of each pixel being a positive sample. Similar to [2], pixels in the shrunk version of the original text regions are considered positive. For each positive sample, the following 4 channels predict its distances to top, bottom, left, right sides of the bounding box that contains this pixel, and the last channel predicts the orientation of the related bounding box. Final detection results are produced by applying thresholding and NMS to these positive samples.

The detection branch loss function is composed of text classification loss and bounding box regression loss. The text classification loss can be seen as a pixel-wise cross-entropy loss for a down-sampled score map. The only shrunk version of the original text region is considered as a positive area. The bounding box regression loss is composed of intersection over union calculated on predicted and actual bounding boxes and one minus cosine to incorporate the loss from orientation. The loss function for classification can be formulated as [3]:

The model is trained with ICDAR 2015 dataset which has 1,000 images (4,500 readable words). The model was validated with validation dataset (ICDAR 2013) consisting of 229 images (848 words). The weights were updated with Adam optimizer with a learning rate of 0.001 and batch-size of 64. The model was trained for 500 epochs with early stopping.

Figure 6. Text detection module.

The model was performing well enough which can be observed from the training loss and validation loss vs epoch curve as shown in Figure 7. The decreasing curves converge which is a clear indication that the model is learning to localize the text regions.

Figure 7. Train loss, validation loss vs epoch

B. Text Recognition

A word is a sequence of characters. Assume each character appears in at defined time stamps. The word can be considered time series and recurrent neural network (RNN) are the best at recognizing time series. We model the recognition problem as a sequence recognition problem. An attention-based sequence recognition network [4][5][6] is suitable for this problem. The model has two components: encoder and decoder as shown in Figure 8. Encoder extracts sequence representation from input image and decoder recurrently generates a sequence conditioned on the encoder output sequence.

Figure 8. Text recognition module.

The encoder uses a convolutional recurrent neural network (CRNN) to extract features from the input image and map the features to a sequence. As illustrated in Figure 8, at the bottom of the encoder is several convolutional layers. They produce feature maps that are robust and high-level descriptions of an input image. Suppose the feature maps have the size D × H × W, where D, H, W are the depth, height, and width respectively. The next operation is to convert the maps into a sequence of W vectors, each has DW dimensions. Specifically, the “map-to-sequence” operation takes out the columns of the maps in the left-to-right order and flattens them into vectors.

A two-layer Bidirectional Long-Short Term Memory (BLSTM) [7] network is applied to the sequence, in order to model the long-term dependencies within the sequence. The BLSTM is a recurrent network that can analyze the dependencies within a sequence in both directions.

Decoder recurrently generates a sequence of characters, conditioned on the sequence produced by the encoder. It is a recurrent neural network with the attention structure [8][9]. LSTM is modeled as a recurrent network to generate a sequence of characters.

The generation is a T-step process, at step t, the decoder computes a vector of attention weights

via the attention process described in [9][4]:

where S(t-1) is the state variable of the LSTM cell at the last step. The states are updated via the recurrent process of LSTM. Finally, the probability over the labels is calculated using the softmax function. The character with the highest probability is considered as the predicted character. The loss is evaluated as the negative log-likelihood over the dataset.

The model is trained with VGG synthetic text data dataset which has 7,224,612 images. The model was validated with validation dataset consisting of 6,400 images. The weights were updated with Adam optimizer with a learning rate of 0.001 and batch-size of 64. The model was trained for 10 epochs and the best model was picked based on the minimal validation loss value.

III. RESULTS

The trained models were tested on real-world chemical reagent images. The detection results are as shown in Figure 9 and recognition results are shown in Figure 10.

Figure 9. Results from text detection.

Figure 10. Results from text recognition.

It can be observed that there are still missing boxes in the detection and false text recognition when the bounding box is just out of the text regions. These models need further revision and improvements to give much better performance.

Finally, the two models were combined to create an end-to-end model as illustrated in Figure 10.

Figure 11. End-to-end model.

IV. CONCLUSION

The models successfully detect and recognizes the text, but there is still room for improvement. The tool can help the scientists to easily upload the off the shelf chemical reagents to the digital lab inventory. Thus making onboarding seamless within seconds. The only limitation would be with the capturing of all the chemical reagents. The chemical reagents must always be arranged in a line of sight facing the camera.

As future work, the models need to be trained with suitable datasets and there need to be innovative modifications in the architecture. Creating synthetic chemical reagents image data to train models would help in improving performance. This model can be extended to different use cases like recognizing text from papers and old lab notebooks.

References

[1] Tesseract OCR. Available at: https://opensource.google.com/projects/tesseract

[2] X. Zhou et al., “EAST: An efficient and accurate scene text detector,” Proc. — 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 2642–2651, 2017.

[3] X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan, “FOTS: Fast Oriented Text Spotting with a Unified Network,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 5676–5685, 2018.

[4] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust Scene Text Recognition with Automatic Rectification,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-December, pp. 4168–4176, 2016.

[5] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao and X. Bai, “ASTER: An Attentional Scene Text Recognizer with Flexible Rectification,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 9, pp. 2035–2048, 1 Sept. 2019. doi: 10.1109/TPAMI.2018.2848939

[6] Luo, Canjie & Jin, Lianwen & Sun, Zenghui. (2019). A Multi-Object Rectified Attention Network for Scene Text Recognition.

[7] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.

[8] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.

[9] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attention-based models for speech recognition. CoRR, abs/1506.07503, 2015.

Deep Learning-based Text Detection and Recognition in Research Lab

Written by Priyank Jain