Advancement of Paddle-OCR over time-An excellent Ablation

My thoughts on building an OCR system


In the past 6 years, I have got opportunities to work on more than 4 OCR projects. I have been constantly working on the OCR modules when it all began first in the industry and now when OCR systems are so much more advanced and capable. The data formats I have worked on were Street level images, handwritten bank cheques, ATM Slips, Invoices and Food nutrition labels. I have practical experience working with modules like contour-based algorithms for text detection, Traffic sign classification using SVM and Grab-cut, tesseract, calamari, EAST: An Efficient and Accurate Scene Text Detector, and Paddle-OCR.

After working on many projects around OCR, I can say that building a successful OCR system is like creating art by sewing together modules working in harmony to perform text detection and recognition on most of your custom datasets. There is not just one GitHub you can pull and run your production model. A successful OCR module comprises of precise object segmentation, a robust Image-Super resolution deep learning model if required, a high recall Text detection deep learning model, a well-written pre-processing script, a High precision text recognition module, a well-written post-processing script and a data scientist who has viewed and worked on a maximum number of dynamic test data. Every OCR tends to suffer a trade-off between speed and accuracy.

A broad classification of the OCR dataset is scene text and document text which can be digital or handwritten text.

Let’s see a few dynamic factors influencing Scene text data— perspective, scaling, bending, noise, fonts, multilingual, motion and radial blur, and illumination.

Dynamic factors influencing Scene text data

Dynamic factors influencing document text data — High density of text on documents, long text, need to structure the results.

The scope of this blog is to quickly understand the evolution of Paddle-OCR from v1 to v3 and pick the one that works best for you.

If you are also excited about this dataset, join me on this journey by following my page so that you stay updated with my blog progress if you wish to work with me ping me on my Linkedin we can definitely discuss this further.

What is Paddle-OCR?

Multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, supports 80+ languages recognition, provides data annotation and synthesis tools, supports training and deployment among server, mobile, embedded and IoT devices).

Overview of the excellent capabilities of Paddle-OCR

Advancement of Paddle-OCR over the time

1. PP-OCR: A Practical Ultra Lightweight OCR System (15 Oct 2020)

The authors of the model introduce a bag of strategies to either enhance the model's ability or light the model.

  • Language support — English, Chinese, French, Korean
  • Consists of three parts text detection, detected boxes rectification and text recognition
  • Text detection -Real-time Scene Text Detection with Differentiable Binarization (Paper). In order to further improve its effectiveness and efficiency, the following six strategies are used: light backbone, light head, remove Squeeze and excitement module from MobileNetV3, cosine learning rate decay, learning rate warm-up, and FPGM pruner (Gradually erasing several unimportant filters, we can prevent an excessive drop in model accuracy). The model size of the text detector is reduced to 1.4M. Light Head of text detector — The head of the text detector is similar to the FPN.
With the development of image classification, the performance of some light backbones on the ImageNet 1000 classification, including MobileNetV1, MobileNetV2, MobileNetV3 and ShuffleNetV2 series. The inference time is tested on Snapdragon 855 (SD855) with the batch size set as 1. As for the choice of scale, we adopt MobileNetV3 large x0.5 to balance accuracy and efficiency empirically.
  • Detection Boxes Rectify- Text direction classification as part of text box rectification done after text box transform. four strategies to enhance the model's ability and reduce the model size: light backbone, data augmentation, input resolution and PACT quantization. Finally, the model size of the text direction classifier is 500KB. Because the text direction classification task is relatively simple, the author uses MobileNetV3 small x0.35 to balance accuracy and efficiency empirically.
  • Text Recognition — In PP-OCR, we use CRNN (Shi, Bai, and Yao 2016) as a text recognizer. The Convolutional Recurrent Neural Network (CRNN), is a combination of CNN, RNN and Connectionist Temporal Classification (CTC) loss for image-based sequence recognition tasks, such as scene text recognition and OCR. To enhance the model ability and reduce the model size of a text recognizer, the following nine strategies are used: light backbone, data augmentation, cosine learning rate decay, feature map resolutions regularization parameters, learning rate warm-up, light head, pre-trained model and PArameterized Clipping acTivation (PACT quantization), KL Divergence helps us to measure just how much information we lose when we choose an approximation.. Finally, the model size of the text recognizer is only 1.6M for Chinese and English recognition and 900KB for alphanumeric symbols recognition. MobileNetV3 small x0.5 is selected to balance accuracy and efficiency empirically. If you’re not that sensitive to the model size, MobileNetV3 small x1.0 is also a good choice. The model size is just increased by 2M, the accuracy is improved obviously.
PP-OCR — A lot of ablation experiments to show the effects of different strategies
The architecture of the text recognizer CRNN. This figure comes from the paper (Shi, Bai, and Yao 2016). The red and grey rectangles show the backbone and head of the text recognizer separately.

I highly recommend going through the elaborated tables of experiments done by the authors to improve the results by changing the combination of training strategy also called ablation in Artificial intelligence.

2. PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System (12 Oct 2021)

PP-OCRv2 created a bag of tricks to train a better text detector and a better text recognizer, which include Collaborative Mutual Learning (CML), CopyPaste, Lightweight CPU Network (PP-LCNet), Unified-Deep Mutual Learning (U-DML) and Enhanced CTCLoss.

Experiments on real data show that the precision of PP-OCRv2 is 7% higher than PP-OCR under the same inference cost. It is also comparable to the server models of the PP-OCR which uses the ResNet series as backbones.

The framework of the proposed PP-OCRv2. The strategies in the green boxes are the same as PP-OCR. The strategies in the orange boxes are the newly added ones in the PP-OCRv2. The strategies in the grey boxes were adopted by the PP-OCRv2-tiny.

What is new in PP-OCRv2

Text Detection

1. Collaborative Mutual Learning (CML) is used as text detection distillation There are two problems with distillation:1. If the accuracy of the teacher model is close to that of the student model, the improvement brought by the general distillation method is limited. 2. If the structure of the teacher model and the structure of the student model is quite different, the improvement brought by the general distillation method is also very limited. The framework is a super network composed of multiple models named student models and teacher models respectively, the CML method can achieve the performance that the accuracy of the student after distillation exceeds the accuracy of the teacher model in text detection. 2. CopyPaste Augmentation

Text Recognition

Lightweight CPU Network (PP-LCNet) Paddle Paddle light-weight counting network In order to get a better accuracy-speed trade-off on Intel CPU, we have designed a lightweight backbone based on Intel CPUs, which provides a faster and more accurate OCR recognition algorithm with mkldnn (Math Kernel Library for Deep Neural Networks) enabled.

Paddle Paddle light-weight counting network, PP-LCNet network structure. The dotted box represents optional modules. The stem part uses standard convolution. DepthSepConv means depthwise separable convolutions, DW means depthwise convolutions, PW means pointwise convolutions, and GAP means global average pooling.

Better activation function- In order to increase the fitting ability of MobileNetV1, we replaced the activation function in the network with H-Swish from the original ReLU, which can bring a significant improvement in accuracy with only a slight increase in inference time.

SE modules at appropriate positions-The SE module increases the inference time so that it cannot be used for the whole network. In fact, through extensive experiments, authors have found that the closer to the tail of the network, the more effective the SE module is. So we just add the SE module to the blocks near the tail of the network.

Larger convolution kernels — The size of the convolution kernel often affects the final performance of the network. In mixnet(Tan and Le 2019), the authors analyzed the effect of differently sized convolution kernels on the performance of the network, and finally mixed differently sized kernels in the same layer of the network. However, such a mixture slows down the inference speed of the model, so the authors tried to increase the size of the convolution kernels with as little increase in inference time as possible.

Larger dimensional 1x1 Conv layer after GAP -In PPLCNet, the output dimension of the network after GAP is small, and directly connecting the final classification layer will lose the combination of features. In order to give the network a stronger fitting ability, the authors connected a 1280- dimensional size 1x1 conv to the final GAP layer, which would increase the model size without increasing the inference time.

Unified-Deep Mutual Learning (U-DML)-Deep mutual learning (Zhang et al. 2017) is a method in which two student networks learn from each other, and a larger teacher network with pre-trained weights is not required for knowledge distillation.

Enhanced CTCLoss -There exists a lot of similar characters in the Chinese recognition tasks. Their differences in appearance are very small which are often mistakenly recognized. In PP-OCRv2, the authors designed an enhanced CTCLoss, which combined the original CTCLoss and the idea of CenterLoss (Wen et al. 2016) in metric learning. Some improvements are made to make it suitable for sequence recognition Tasks.

PP-LCNet-1.0x accuracy is 71.32 % being the top most with inference time 3.16 ms on the CPU used in the test is Intel(R)-Xeon(R)-Gold-6148-CPU, the resolution of the image is 224x224, the batch-size is 1 on ImageNet-1k.

3. PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System

A more robust OCR system PP-OCRv3 is proposed in this paper. PP-OCRv3 upgrades the text detection model and text recognition model in 9 aspects based on PP-OCRv2.

For the text detector, the authors have introduced a PAN(Pixel Aggregation Network), module with a large receptive field named LK-PAN, an FPN module with a residual attention mechanism named RSE-FPN, and a DML distillation strategy.

For text recognition, the authors introduce a lightweight text recognition network SVTR-LCNet, guided training of CTC by attention, data augmentation strategy TextConAug, better-pretrained model by self-supervised TextRotNet, U-DML, and UIM to accelerate the model and improve the effectiveness.

Experiments show that the Hmean of PP-OCRv3 outperformsPP-OCRv2 by 5% with comparable inference speed.

The framework of the proposed PP-OCRv3. Strategies in the green boxes are the same as PP-OCRv2. Strategies in the pink boxes are the newly added ones in the PP-OCRv3. Strategies in the grey boxes are adopted by PP-OCRv3 tiny models.

The overall framework of PP-OCRv3 is the same as that of PP-OCRv2, which consists of three parts: text detection detected box rectification and text recognition. In PPOCRv3, the text detection model and text recognition model are further optimized, respectively.

Text Detection

The training framework of the PP-OCRv3 detection model is still CML (Collaborative Mutual Learning) distillation, which was proposed in PP-OCRv2. The main idea of CML is to combine the traditional distillation strategy of Teacher guiding Students and DML(Deep Mutual Learning), which allows the Student networks to learn from each other.

CML distillation framework of PP-OCRv3 detection model

For the teacher model, a PAN module with a large receptive field named Large Kernel PAN (LK-PAN) is proposed and the DML distillation strategy is adopted; for the student model, an FPN module with a residual attention mechanism named RSE-FPN is proposed.

Pixel Aggregation Network (PAN)

An efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), is equipped with a low computational-cost segmentation head and learnable post-processing. More specifically, the segmentation head is made up of Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM). FPEM is a cascadable U-shaped module, which can introduce multi-level information to guide better segmentation. FFM can gather the features given by the FPEMs of different depths into a final feature for segmentation. The learnable post-processing is implemented by Pixel Aggregation (PA), which can precisely aggregate text pixels by predicted similarity vectors. Experiments on several standard benchmarks validate the superiority of the proposed PAN. It is worth noting that the PAN method can achieve a competitive F-measure of 79.9% at 84.2 FPS on CTW1500

Large Kernel PAN (LK-PAN) — The main idea is to increase the convolution kernel size in the path augmentation of the PAN module from 3x3 to9x9, which can improve the receptive field of each pixel of the feature map, making it easier to detect text in large fonts and text with extreme aspect ratios.

Schematic diagram of LK-PAN

Residual Squeeze-and-Excitation FPN (RSE-FPN)— RSE-FPN introduces a residual attention mechanism by replacing the convolution layers in FPN with RSEConv, to improve the representation ability of the feature map.

Schematic diagram of RSE-FPN

DML: Deep Mutual Learning for Teacher Model DML(Deep Mutual Learning) can effectively improve the accuracy of the text detection model by learning from each other with two models with the same structure. The DML strategy is adopted in the teacher model training to improve the Hmean of the teacher model as much as possible.

Text Recognition

The recognition model of PP-OCRv3 is optimized based on the text recognition algorithm SVTR: Scene Text Recognition with a Single Visual Model. SVTR no longer involves RNN(Recurrent Neural Network) by introducing the structure of a transformer, which can mine the context information of text line images more effectively. To make SVTR more practical, the authors adopt six strategies to optimize and accelerate the model.

SVTR-LCNet: Scene Text Recognition with a Single Visual Model-Light wight counting network Lightweight Text Recognition Network SVTR-LCNet is a lightweight text recognition network fusing the Transformer-based network SVTR (Du et al. 2022) and lightweight CNN-based network PP-LCNet (Cui et al. 2021) Paddle Paddle light-weight counting network.

GTC: Guided Training of CTC by Attention Connectionist Temporal Classification (CTC) and attention mechanism are two main approaches used in recent scene text recognition works. Compared with attention-based methods, the CTC decoder can achieve a much faster prediction speed, but lower accuracy. To obtain an efficient and effective model, the authors use an attention module to guide the training of CTC to fuse multiple features, referring to the GTC (Hu et al. 2020) method, which is effective for the improvement of accuracy. As the attention module is completely removed during prediction, no more time cost is added to the inference process.

TextConAug: Data Augmentation for Mining Text Context Information TextConAug is a data augmentation strategy for mining textual context information. The main idea comes from the paper ConCLR (Zhang et al. 2022), in which the author proposed the data augmentation strategy ConAug to concat 2 different images in a batch to form new images and perform self-supervised comparative learning.

TextRotNet: Self-Supervised Pre-trained Model TextRotNet is a pre-trained model trained with a large amount of unlabeled text line data in a self-supervised manner, referred to previous work STR-Fewer-Labels.

U-DML: Unified-Deep Mutual Learning U-DML is a strategy proposed in PP-OCRv2 which is very effective to improve the accuracy without increasing model size. In PPOCRv3, for two different structures SVTR-LCNet and attention module, the feature map of PP-LCNet, the output of the SVTR module and the output of the Attention module between them are simultaneously supervised and trained.

UIM: Unlabeled Images Mining UIM is a simple unlabeled data mining strategy. The main idea is to use a high-precision text recognition model to predict unlabeled images to obtain pseudo-labels and select samples with high prediction confidence as training data for training lightweight models.

Difference between tesseract-ocr and PaddleOCR


Experiments show that the precision of PP-OCRv2 is 7% higher than PP-OCR under the same inference cost.

The author proposes a more robust OCR system PPOCRv3 which involves 9 improvements, 3 of which are for the detector and 6 for the recognizer. Experiments demonstrate that the Hmean of PP-OCRv3 outperforms PP-OCRv2 by 5% with the same prediction cost.




Machine Learning & AI in Automated Map Making

Computer Vision and Deep Learning contributor. Never misses a chance to learn. Lead Data Scientist @