Recognizing digits of loyalty cards using CNN and Tesseract 4

Published in

Stocard

9 min readDec 8, 2017

*Example digits you can find on loyalty cards*

During the last six months, I worked on an OCR project at Stocard . With this blog post I want to share some of my results and insights with you!

The goal of my research

With Stocard you can digitalize all your physical loyalty cards and carry them around inside your smartphone. This means that the dark era of carrying around a bazillion plastic cards inside your wallet has finally come to an end. :-)

Jokes aside, one of our main goals at Stocard is to make the app as user-friendly as possible. A lot of cards have a barcode and for those cards it is quite easy to digitalize them with a barcode reader. But there are also cards which do not have a barcode, and the users had to type in the number until now.

We want to make this easy, too. This is why we research computer vision approaches to recognize the card numbers. I looked at the problem of detecting individual digits, as shown in the image on top of this article.

This approach can later be used in a larger framework to automatically read the numbers on a loyalty card when the user holds it in front of the camera, like magic.

I trained Convolutional Neural Network (CNN) classifiers and compared the results with the OCR engine Tesseract 4. Which classifier will solve the most images correctly?

Data set

Before talking about the approaches in detail, I want to give you some insights about the data I used. I split the images in two data sets:

The training set consists of 30685 images of digits from loyalty cards, which were augmented with computationally generated variants. The classifiers were trained with those images. This set is balanced, which means there is the same number of images for each digit of each loyalty card.
The test set contains 1246 new images the classifiers have never seen before, and was used at the end of the training phase to evaluate the result and compare the different approaches. This set is not balanced, which means each card type and each digit appears approximately as often as we would expect in real-world data.

*Computationally generated variants of the same original image*

Evaluation

Let me shortly explain the common evaluation metrics for classifiers. If you already know what recall, precision and F1 score mean, feel free to skip this section. If you are new to this, this Wikipedia page has a more detailed description.

The recall shows which percent of images of a certain class were correctly classified to belong to this class. Therefore, the number of samples that were correctly classified (“true positives”) are divided by the number of true positives plus the number of images of this classes incorrectly classified as belonging to another class (“false negatives”).

Formulas for recall, precision and F1 score

The precision shows which percent of images that were classified as a certain class actually are images of this class. Therefore, the true positives are divided by the number of true positives plus the number of images of other classes incorrectly classified as belonging to this class (“false positives”).

To compare classifiers, it is convenient to have only one metric instead of two. This often done by using the F1 score. This metric is calculated as the harmonic mean of recall and precision.

Approaches and results

In the following, I will go through the approaches I tested, and the different variants and improvements I tried. I’ll point out which actually improved the results, which had no effect for me, and finally show you what the best classifier and arragement is that I could find.

I hope you’ll enjoy the journey!

Tesseract modes

The first idea I pursued was to use Tesseract 4. As of October 2017 this version was still in beta development, but you could already checkout the github repository and build it yourself. The cool thing is that it is the first version with the new LSTM network.

This means that there is a new classification algorithm in Tesseract, and you have the option to choose between the original mode (the engine in Tesseract 3), the LSTM mode, and a combination of the old and the new mode.

So, how do the three modes compare?

Results for the different Tesseract 4 modes

If you take a look at the F1 score, you will see that the original mode beats the new LSTM mode. This is unexpected, since results published by the developers of Tesseract show that the LSTM-Mode is better than the original mode. Is there a way to explain these results?

Yes, there is. The reason for this result is that the LSTM mode cannot be configured to use a limited result alphabet. This is a massive disadvantage for the LSTM mode on this dataset, because it consists of digits only. A lot of the false predictions are predictions where for a digit gets classified as a letter, for example “0” as “a”. This explains the low recall value. If we look at the precision value, we find that we can trust the LSTM mode when it actually identifies a digit. Then, we can be sure that what Tesseract says is a “0” actually is a “0”.

The conclusion we can draw from this is that the LSTM-Mode is definitely interesting, but the lack of a character whitelist limits its usefulness for our dataset.

Color conversion

A second thing I tested was converting the color image into a binary black-and-white image before running Tesseract. The original mode of Tesseract was created for printed text documents, and is known to sometimes work better when the image is converted beforehand. This is why I tested the Otsu binarization algorithm with Tesseract.

Results for color conversion with the LSTM mode

The tests show the that the F1 score indeed increases when using Otsu binarization. As we can see, this is true for the new LSTM mode as well.

CNN

My second approach was to train a CNN to classify digits. I used a net architecture which was inspired by the famous LeNet architecture.

*Used CNN architecture, inspired by the LeNet architecture*

The net takes RGB color images as input. These images have to have to a width of 15px and height of 30px.

To train nets I used the Caffe framework. Caffe was originally developed for creating CNNs and offers a C++ API. Furthermore, it is possible to load Caffe networks with the OpenCV library. For me it therefore seemed to be the best framework for a later integration into our iOS app.

During my project Caffe2 was published. It was especially developed to support app integration and is able to load nets created with Caffe. It came a bit too late for me, but is definitely something one should consider for future developments.

But for now, let us proceed with the findings I made!

Single CNN vs. CNN per card

The training set consists of digit images from 14 different card types. The first Idea I had was to train a single CNN using all the training data. This convenient, since we can use a single net for all card types.

However, the user selects his card type in the Stocard UI before the scanner opens, so we already know the card type when the OCR scanner would open. I wondered if the F1 score can be improved if every card type gets its own CNN. Therefore, I trained a CNN for every card type. The combined classifier then always chooses the correct CNN for an input image, based on the card type that is passed alongside it.

It turned out that a CNN for each card type indeed gives the better results. From this point on I always trained CNNs for each card type, and the following tests use this model as their baseline.

Weights

I tried to improve the performance by using more feature maps in the convolutional layers. I went from layer c1=20, layer c2=50 feature maps up to c1=120, c2=150 feature maps using a step size of 20 additional maps per test run. It did not really improve the F1 score, but it greatly extended the training time. Therefore, I considered it a fail.

Number of samples for training

The next thing I wanted to know is how many images do I need per digit for a single type of customer card. The plot shows that around 1020 images the slope starts to slow down. Therefore, I consider 1020 images the minimum number of images per digit one should use when training a CNN.

Influence of the number of samples per digit on the F1 score

Number of original images used for augmentation

Since the number of total images influences the result, I wondered how the number of original images used for augmentation influences the results. Is it ok to use just one image and create 519 artificial copies, or would it be better to use 10 original images and create 51 artificial copies of each?

Influence of the number of original images, artificially upsamled to 520 images, on the F1 score

The answer is no surprise, more original data is better. Augmentation is no valid shortcut for collecting original data.

Augmentation strategies

I tested which augmentation strategies resulted in the best F1 score. This means I tried all combinations. First I tried the single strategies, then two strategies paired together, than three together and finally all strategies together.

This resulted in 15 CNNs, each trained on slightly different data. The results show that brightness and shifting have the largest effect on the results.

Ensemble of multiple CNN

Now, let’s get crazy! What would happen if I not just train a CNN specifically for each provider, but train multiple CNN for each provider and let them decide by majority vote?

*Using differently trained CNNs as an ensemble and deciding by majority votes*

I thought it could be possible that the F1 score improves when I use slightly differently trained CNNs and let them vote. I used the 15 nets I trained for each card type when evaluating the augmentation strategies, and combined them in an ensemble. Each of the 15 classifiers predicts a class for the input image, and then we decide by majority vote over the results of all classifiers.

*Results for a CNN per card and an ensemble of multiple CNN per card*

The ensemble of multiple CNN is slightly better than a single CNN, and gives the best score I measured overall. The final system reaches a recall of 0.906 on the individual digits.

Summary

In conclusion, here is what I found out during this project:

The Tesseract LSTM implementation is promising, but currently lacks an easy way to limit the result alphabet
Individually trained CNN for each card provider beat a one-net-fits-all approach
Ensemble learning with multiple CNN for each card provider have even better results
Augmentation is helpful, especially brightness and shifting variations
More training data improves the results, the quality of the data is more important than the quantity

Outlook

While the results clearly show that CNN are useful for digit classification, there is still a lot left to do. The digit classifier needs to get integrated into a general card number recognition pipeline. This pipeline needs to solve problems like finding the customer number region inside the card image and segmenting the single digits. These segmented digits can then be passed to the digit classifier described in this article.

After recognizing a number, it is also reasonable to implement a post-processing step which verifies that the card number is correct.

As you can see there is still room for future work. If you enjoyed this blogpost, I recommend taking a look at the Stocard jobs page. Who knows, maybe you will be the next one who publishes a blog post about your work at Stocard!

Thank you for reading :-)