An inside look into state-of-the-art Image and Optical Character Recognition

Published in

SFU Professional Computer Science

10 min readFeb 12, 2021

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

Imagine a situation where a document you scanned remained a digital photograph, and you couldn’t grab its text. What if Google Books weren’t searchable and you were to find everything from it manually. If Text weren’t recognized as ‘Text” from digital documents/images, our lives would have been way more complicated. Optical Character Recognition, or OCR in short, is the technology used to solve all these problems!

As the name says, Optical Character Recognition is used to recognize characters in printed or handwritten documents and convert them into machine-readable text. It has many applications in the real world, such as scanning documents into editable form, extracting data from driver’s licenses to invoices, and storing historical and legal data in a digital and searchable format. Apart from making our daily tasks easier, OCR is used as an aid for the visually impaired by scanning text from documents and a text to speech software could read it aloud for them.

License Plate Reader (https://en.wikipedia.org/wiki/Automatic_number-plate_recognition)

Handwriting Recognition (https://vidado.ai/handwriting-ocr/)

So how does the technology actually recognize all these characters? OCR uses machine learning methods for this. State-of-the-art Optical Character Recognition makes use of Convolutional Neural Networks (CNNs). Here we go a little beyond just OCR and discuss a combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (or LSTM for short) Network to create a description (or label) for the input image. The pipeline for such a Long-term Recurrent Convolutional Neural Network looks like the below image, and we are going to explain each step of the process.

1. Image Pre-Processing

Images often come in a format that is not yet ideal as the input required for our machine learning model. If we want the most optimal input for our model and the most accurate output, we need to prepare the image in advance. This is a vital process to OCR.

The common problems with images are skewed, monochrome, the wrong size, or too noisy. There are several ways to address these issues, the Python library OpenCV comes in handy as a solution:

✦ De-skewing

Images can come in at odd rotations. To detect strange rotations, we can use deep learning methods or extract the text blob and check the rotation of this text blob. (1)

✦ Binarization

This is the process of converting a multi-colored image into a black-and-white image to separate the objects from the background. The threshold is very important if we want to get the right ratio.

✦ Noise Removal

There are often pixels within the image that are too sharp or too intense that interfere with the machine learning models. We want to smoothen the image, and OpenCV can also address this problem (2)

✦ Rescaling

If the image is too small, below 300 DPI, it will not give the best output, while if the image is too big, it has a diminishing contribution to the accuracy of the output and also slows down the processing speed. (3)

**✯ Implementation of Binarization ✯**

We are going to have a deep dive into the binarization process. This is mainly to distinguish the part that you want to recognize from the background. In this method, we need to find a threshold. If the pixel value is greater than this threshold, convert it to a white pixel (value = 255), and if it is less than the threshold, convert it to a black pixel (value = 0).

The code is composed of two types of information: 1, 0. We need this method to determine which part is black and which part is white.

We cannot implement all parts in code because of space limitations, so we picked out a few binarization methods to discuss.

Global thresholding

This is the simplest thresholding method. It traverses all pixels and then re-assigns them according to the set threshold value. Here, the threshold value is 127 because it is exactly half of the pixel range 0–255.

Otsu’s Binarization

In global thresholding, we generally choose 127 as the threshold value, but we don’t know whether this effect is good or not, so if we want to find an ideal value, we can only experiment one by one, but this will consume a lot of time and energy. Here, we use the Otsu algorithm to determine the threshold, an adaptive threshold determination method. The basic principle is to find the lowest of the two peaks in the bimodal image as the threshold.

Adaptive Thresholding

The above two methods are global threshold methods, so they do not apply to images with uneven lighting. The processed image is not the result we want. In this case, we need an adaptive threshold. This method is not to calculate the global threshold but divide a picture into different areas by brightness. In each region of the image, we calculate its threshold separately. There are also many ways to determine the local threshold, generally calculating the mean of Gaussian weighted average.

2. Convolutional Neural Network and Long Short-term Memory Network

An image is a spatial structure, which not many machine learning methods are equipped to deal with. The CNN is used for ‘Feature Extraction’ of the text in the image. Its output is fed to the LSTM for more advanced purposes such as image labeling or increasing the accuracy of character recognition with sequential learning. We are going to explain this process below:

✦ Convolutional Neural Network

Convolutional Neural Networks are similar to regular Neural Networks. They have weights that are trainable based on gradient descent of loss functions through various layers. Convolutional Neural networks, however, can ingest images as tensors. Tensors are matrices with additional dimensions. The image below can help conceptualize what tensors are, and as you can imagine, tensors can go on indefinitely. (4)

There are 3 dimensions to an image in a Convolutional Neural Network: width, height, and depth (which contains information about RBG color composition).

CNNs have hidden layers called convolutional layers. Each layer runs ‘filters’ over patches of the image. Filters are small matrices capable of identifying patterns (such as edges, corners, shapes, etc.) in these patches of images. As we go deeper down in the neural network, the filters can detect finer and more sophisticated image features. For each convolutional layer, we have to specify the number of filters the layer should have. The filters’ purpose is to multiply itself with the matrix patches of the image. If the filters have the same positions with high numbers as the patches, they produce high dot products, which means that the patches have the same pattern as the filter. (5)

Each dot product is put into an activation function, and all of them create an activated map. This limits the dot products to -1 to 1 or 0 to 1, depending on which activation function we choose. (6)

This activation map is then put into a max-pooling layer, where only the biggest values are kept, and the rest are discarded.

The last layer is called the fully-connected layer. This layer contains the output from the network. The output is a vector, and three hyperparameters control the size of the output: number of filters, stride, and zero paddings. (7)

The output is then passed onto the Long Short-Term Memory Network, which we will discuss shortly in brief. We will now go into a demonstration of CNN for character recognition in the below section:

**✯ Implementation of a simple CNN model ✯**

We used two databases here, the letter ‘A’ comes from data.gov (8), and the letter ‘B’ comes from Kaggle (9). We select 3000 images from the two databases as the training set and 600 images as the test set.

Here we first convert the image into a CSV file. If you are interested in how this is done, you can click the link below to check it out on Github.

suohongliu/cmpt733-blog

Contribute to suohongliu/cmpt733-blog development by creating an account on GitHub.

github.com

We built a sequential model with one convolutional layer. Then we added the max-pooling layer and flattening layer. The max-pooling step is to reduce the size of the original image. This step does lose some pixels, but it allows us to extract the image’s critical information. Therefore, max-pooling can avoid overfitting and improve the speed of model training. Flattening can convert data into a 1-dimensional linear vector to build the model. Finally, we added a dense layer to indicate that our model is used to predict (0–1), two letters A and B.

✦ Long Short-Term Memory Network

To improve the accuracy of the predictions, we can use LSTMs on top of CNN for deep learning sequences. LSTMs are basically Recurrent Neural Networks that have overcome some of RNN’s limitations. As opposed to a feed-forward Neural Network like the CNN, RNNs are slightly more complex, with the output of the neural network being fed back as its input. However, due to the way they work, RNNs have short term memory, and it takes a long time for them to process large data.

One big advantage of the Long Short-Term Memory Network is that it can deal with exploding and vanishing gradient problem, which is encountered in recurrent neural networks. This problem arises as the weights get updated via backpropagation. Small weights are multiplied and become increasingly smaller, while big weights become increasingly bigger, and they greatly burden the network.

LTSM addresses this problem by having a “gate” system. These gates are decision-making sigmoid functions, so everything that passes becomes 0 or 1 based on their parameters and input. Therefore, small weights get eliminated, and only big weights can pass. (10)

The purpose of LTSM is to predict each word of the sentences based on the input it received from CNN as well as the words preceding it in the sequence. Among the predictions, only the words with maximum probability are chosen as the final output. We can do this by continuously minimizing the loss function and updating the weights. (11)

For image processing, LTSM needs CNN because LTSM is very particular about its input and cannot pre-process spatial structures such as images on its own. The LTSM can then be trained based on the output of CNN to classify the image and produce a description in a sequence that makes sense, as it has the ability to accumulate information from all previous cells.

3. Post Processing

In this final stage, the output of the OCR system is analyzed for any context-based or grammatical ‘errors’ to improve the accuracy of the system. So how is this done? There are a couple of strategies adopted for this.

✦ Proof Reading

This is the most obvious way of correcting the document and is done manually by human effort. Although it can be precise, this method is also prone to human errors and often results in exhaustive work.

✦ Lexicon Error Correction

This approach is used to correct spellings in the OCR text by comparing a word against a lexicon, i.e., a dictionary map of similar words. It is usually done using the n-gram model. The process depends on the lexicons and doesn’t care about the context where a word is used.

✦ Grammar/Semantic Error Correction

Context-based corrections are usually performed with Statistical Language Models. Sometimes the combinations of several algorithms are used to achieve this. The Levenstein distance is also used commonly for Post-processing.

✦ Google Spelling Suggestion

To correct spelling, Google’s online spelling suggestion can also be used, which gives results based on Google Search on the words recognized by the OCR. This shows better results than the Lexicon based correction using a dictionary.

Summary

In this post, we learned about how Image and Optical Character Recognition works. In the future, we would like to delve further into the statistics and calculus behind the neural networks that empower this technology.