Review Paper: “PP-OCR: A Practical Ultra Lightweight OCR System” — Part I

In this post, I’ll review PP-OCR which is a practical ultra-lightweight OCR system and can be easily deployed on edge devices such as cameras, mobiles,…

5 min readApr 18, 2022

Original paper: https://arxiv.org/pdf/2009.09941.pdf

In this paper, the authors propose a practical ultra lightweight OCR system: PP-OCR. The overall model size of the PP-OCR is only 3.5M for recognizing 6622 Chinese characters and 2.8M for recognizing 63 alphanumeric symbols, respectively. Several pre-trained models for the Chinese and English recognition are released, including a text detector (97K images are used), a direction classifier (600K images are used) as well as a text recognizer (17.9M images are used). Besides, the proposed PP-OCR are also verified in several other language recognition tasks, including French, Korean, Japanese, and German.

Fig. 1: Some image results of the proposed PP-OCR system. ([1])

My paper summary will be in 2 parts:

Part I: Review overall architecture and text detector on paper
Part II: Review direction classification and text recognitor on paper

Besides, I’ll write a post about how to set up, train, and test the model.

1. Overall Architecture

**Fig. 2**: The framework of the proposed PP-OCR ([2])

Figure 2 shows the overall architecture of PP-OCR. It has three parts: text detection, detected boxes rectification and text recognition.

Text detection: The authors use Differentiable Binarization (DB) as text detector. They propose six strategies to improve effectiveness and efficiency (the model size of the text detector is reduced to 1.4M):

Light backbone
Light head
Remove SE module
Cosine Learning Rate Decay
Learning rate warm-up
FPGM Pruner

Detection Boxes Rectify: After detecting the text box, it will be transformed into a horizontal rectangle box for text recognition task. The authors use a classify model to determine the text direction. They use four strategies to enhance the model ability and reduce the model size (the model size of the text direction classifier is only 500KB):

Light backbone
Data augmentation
Input resolution
PACT quantization

Text Recognition: The authors use CRNN as text recognitor. CRNN integrates feature extraction and sequence modeling. They propose nine strategies to improve the model (the model size of the text recognizer is only 1.6M):

Light backbone
Data augmentation
Cosine learning rate decay
Feature map resolution
Regularization parameters
Learning rate warm-up
Light head
Pre-trained model
PACT Quantization

2. Enhancement or Slimming Strategies

2.1 Text Detection

**Fig. 3:** Architecture of the text detector DB. ([2])

The authors use DB as text detector. Because DB is based on a segmentation network, it can accurately describe scene text of various shapes such as curve text. The detail of the DB model is in [2]. I’ll write a review about it. Official code of DB (Pytorch framework) is here:

GitHub - MhLiao/DB: A PyTorch implementation of "Real-time Scene Text Detection with Differentiable…

This is a PyToch implementation of DBNet( arxiv) and DBNet++( TPAMI, arxiv). It presents a real-time arbitrary-shape…

github.com

Light Backbone: To reduce the DB model’s size, the authors test and compare models which are often used as the light backbone in classification tasks such as MobileNetV1, MobileNetV2, MobileNetV3, and ShuffleNetV2.

Fig. 4: The performance of some backbones on the ImageNet classification ([1])

Figure 4 shows the performance of some backbones in the ImageNet classification task. To balance accuracy and efficiency, the authors adopt MobileNetV3_large_x0.5. However, in the official code, they were public 122 models’ pre-trained weights and their evaluation metrics: ResNet, DenseNet, …

Light Head: Similar to object detection models, the head of the text detector uses feature maps of the different scales to improve the effect for the small text regions detection. The authors reduce the number of 1x1 convolution layers from 256 to 96. Therefore, the model size is reduced from 7M to 4.1M, but the accuracy declines slightly.

Fig. 5: Architecture of the SE block ([1])

Remove SE: SE (squeeze-and-excitation) [3] adaptively recalibrates
channel-wise feature responses by explicitly modeling interdependencies between channels. The authors find that if the input resolution is large (640x640) it is hard to estimate the channel-wise feature responses with the SE block. So, the authors remove SE blocks from the backbone and the model size is reduced from 4.1M to 2.5M, but the accuracy has no effect.

Cosine Learning Rate Decay: The authors use cosine learning rate decay during training. In the early stage of training, we can use a relatively large learning rate for faster convergence. In the late stage of training, the weights are close to the optimal values, so a relatively smaller learning rate should be used.

Learning rate warm-up: The authors use learning rate warm-up from [4] in the training process. At the beginning of the training, all parameters are typically random values and therefore far away from the final solution. Using a too large learning rate may result in numerical instability. In the warmup heuristic, they use a small learning rate at the beginning and then switch back to the initial learning rate when the training process is stable.

FPGM Pruner: FPGM is a method that compresses CNN models by pruning filters with redundancy, rather than those with ‘’relatively less’’ importance.

3. Conclusion

In this article, I reviewed text detector in PP-OCR which is a light system for OCR problems. By various strategies, text detector is only 4.1 MB and still keep the good result. In part 2, I will review the direction classifier and text recognition in PP-OCR.

You can find the official source code at:

GitHub - PaddlePaddle/PaddleOCR: Awesome multilingual OCR toolkits based on PaddlePaddle (practical…

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages…

github.com

If you have any questions, please comment below or contact me via linkedin or github

If you enjoyed this, please consider supporting me.

Resources:

[1] PP-PCR: https://arxiv.org/pdf/2009.09941.pdf

[2] Differentiable Binarization: https://arxiv.org/pdf/1911.08947.pdf

[3] Squeeze-and-Excitation Networks: https://arxiv.org/pdf/1709.01507.pdf

[4] Learning rate warm-up: https://arxiv.org/pdf/1812.01187.pdf

[5] FPGM Pruner: https://arxiv.org/pdf/1811.00250.pdf