Survey on Image Preprocessing Techniques to Improve OCR Accuracy

Even the best OCR tool will fail to produce good results when the input image/document quality is too bad. Understanding the nuances of source image quality and also the techniques to improve the document quality will obviously improve your OCR Accuracy.

Mageshwaran R
Technovators
10 min readJan 6, 2021

--

Photo by Greg Rosenke on Unsplash

Most of the OCR engines come with in-built image processing techniques to automatically improve the quality of the image, but the problem here is, you may not be able to tweak the parameters based on your use case. Understanding and building your own image processing pipeline will be helpful if you have a good idea about the source documents and noises/distortions expected in them. Let us explore them in detail in the upcoming sections.

Topics Covered

  • Measuring Image Quality
  • Deep dive into OCR pre-processing
  • Recommended Tools for OCR image pre-processing

Measuring Image Quality

We know that the better the quality of the source image, the higher the accuracy of OCR will be.

But how can we measure the quality of the source image or what are some factors that we can consider to have in the source image?

  • Characters should be distinguishable from the background: Sharp character borders, High Contrast
  • Characters / Words Alignment: Good alignment ensures proper character, word, and line segmentation
  • Good image resolution and alignment
  • Less Noise: We will explore more in the next section

Above mentioned features make the document quality better from the OCR perspective. Let’s now deep dive into the possible issues that are related to image quality and the ways to tackle them.

Deep dive into OCR pre-processing

As stated above, the source image quality for OCR depends on various factors and can be broadly related to the presence or absence of noises/distortions, proper image and text alignment, image resolution, and local contrast.

Let’s now address them one by one,

Right Resolution

The standard recommended resolution for OCR is 300 DPI (Dots Per Inch). However, based on the font size used some OCR engines internally scales the original image.

  • For regular text (font size > 8), it is recommended to go with 300 DPI itself
  • For smaller text (font size < 8), it is recommended to have 400–600 DPI
  • Anything above 600 DPI will only increase the image size thereby increasing the processing time but shows no improvement with OCR accuracy.
DPI Comparison, source

Image Binarization

Binarization is the process of converting a colored image (RGB) into a black and white image. Most of the OCR Engines does this process internally, Adaptive binarization is the most popular technique used for this conversion, it works based on the features of neighboring pixels (i.e) local window.

Comparison of Global vs Adaptive thresholding, source

Image Contrast and Sharpness

By increasing the local contrast (i.e) contrast between text and background, each character will be easily distinguishable from the background and makes it easier to recognize the character. Similarly, having sharp borders between characters will be helpful for character segmentation and recognition.

In most cases, using the global contrast is not a good option because different parts of the image may have different contrasts. Hence, Contrast Limited Adaptive Histogram Equalization (CLAHE) is a very effective pre-processing step to improve the text and background contrast.

Image Geometric Transformations

Based on the image capturing technology used, source images can be subject to different types of Image misalignment. We may need to perform some simple to advanced geometric transformations to handle this misalignment to give an ideal source image for the OCR engine.

Page Rotation: This step is pre-built in all the OCR engines. First, it detects the orientation of each page and then corrects the page orientation prior to Text Recognition.

Deskew or Skew Correction: Most Images captured from flatbed scanners or photographed by digital cameras are slightly skewed. So, Detecting the skew angle and rotating the page so that text will appear horizontal and not tilted in any angle.

Skew Correction Example, source

Keystone Effect or Trapezoidal Distortion: When the scanned document is not parallel to the scanner(camera), the source image captured will have a keystone effect (i.e) the shape of the source document will look like a trapezoid instead of a rectangle. This issue typically occurs while capturing / scanning images from mobile devices or digital cameras.

Keystone Effect(left), Corrected Image(right), source

For Keystone Correction, the system should first detect the trapezoid representing the scanned document, and do an affine transformation to convert the trapezoidal document into a rectangle, and then remove edges that do not contain any useful data.

Correction of 3D perspective distortions: This issue is more specific to images captured from Mobile devices or Digital cameras, similar but a complex version of the previous effect. Due to 3D perspective distortion, the Font Size of the document will vary from Top to Bottom, and also text at top of the page will not be clear. Once the image transformation is applied and corrected font size will look almost similar and gives better OCR results.

Image by Author, Original Image (left) and Transformed Image(right)

If you look at the above image(left: Original Image), text in the top region is not even human-readable but the same image after 3-D perspective transformation looks like a rectangle, and font size is the same across the document regions, and it's human-readable. However, the lines are not yet straight in the processed image which may lead to issues in line segmentation and OCR. We will next look at an approach to handle this issue.

Lines Straightening: When the lines are curvy as in the case of the above image, it may result in OCR issues and can cause issues with line segmentation and text re-arrangement. Hence, detecting the curved lines and straightening them will improve our OCR results.

Lines Straightening, source

Other Image Transformations:

  • Image Cropping and Scanned image border removal
  • Image Mirroring
  • Color Inversion when different regions have different Foreground and Background colors

Noise Removal

Noise is a random variation of brightness or color information in an image, that can make the text of the image more difficult to read. Most of the common noises are handled in Image Binarization and Contrast, and Sharpness Adjustment steps.

But, based on the nature of the source image, different types of noises may be present which needs to be handled in a specific way. Let us explore them now.

Blurring or Smoothing of an image removes “outlier” pixels that may be noise in the image. There are various filters that can be used to blur images and each has its own advantages and disadvantages.

  1. Gaussian Blur: Uses Gaussian kernel for convolution and good at removing Gaussian noise from the image. It is much faster compared to other Blurring techniques but fails to preserve edges which may affect OCR output.
  2. Median Blur: Replaces the central element in the kernel area with the median value of the pixels under the kernel area. It is good at removing salt and pepper noises from the scanned document.
  3. Bilateral Filtering: It is highly effective in noise removal while keeping the edges sharp. Along with the Gaussian filter in space, it also takes another Gaussian filter which is a function of pixel difference. The Gaussian function of space makes sure that only nearby pixels are considered for blurring, while the Gaussian function of intensity difference makes sure that only those pixels with similar intensities to the central pixel are considered for blurring. So it preserves the edges since pixels at edges will have large intensity variation.

Both bilateral and median filter is good at preserving edges but the former one is very slow due to its computational complexity.

Image Despeckling is a common technique used in the OCR noise removal step which is actually an adaptive bilateral filtering technique. It removes noises from the scanned image while preserving edges and other complex areas from blurring. It is very useful in removing granular marks from scanned images. When applied incorrectly it may remove commas and apostrophe from the image by considering them as noise.

Original Image(left), Despeckled Image(right), source

If you look at the despeckled image, most of the noises are removed but in complex areas where the text is highlighted and noise is much closer to the text region, it keeps the noises untouched which prevents the pre-processing step from removing character edges.

Handling Blurred Images: While we saw some of the blurring techniques to remove noises from the image, there are chances that the source image itself is blurred. This occurs mainly when the camera is not fixed while capturing the image. In this case, the text will be human-readable but may cause OCR issues. Using techniques like sharpening or edge enhancement will be helpful here.

Comparison of Blurred image vs pre-processed image results, Image by Author

ISO Noise Correction: ISO level is the sensitivity level of the image sensor in the camera to the light. ISO gain which is an amplifier that improves the quality of the image in low light conditions will eventually amplify the noise as well which may affect the binarization step and therefore reduces OCR quality. By smoothening the image background, we will be able to reduce ISO noise and get better OCR results.

Comparison of ISO Noise Image vs pre-processed image

Image Denoising using Auto Encoders: With the evolution of Deep Learning in Computer Vision, there has been a lot of research into image enhancement with Deep Neural Networks like removing noises from images, Image Super-Resolution, etc. Autoencoders are composed of an encoder and a decoder architecture. Where the encoder compresses the input data into a lower-dimensional representation and the decoder reconstructs the representation to obtain an output that mimics the input as closely as possible. In doing so, the autoencoder learns the most salient features of the input data.

Recommended Tools for OCR image pre-processing

We have seen a wide range of pre-processing techniques to improve the source image quality. While some of these techniques may be pre-built with OCR engine or OCR engine may use some open-source tools for pre-processing, it is highly recommended to explore and configure the various pre-processing techniques used in your OCR engine and create an additional pipeline with pre-processing steps that are very specific to your use case.

Here are some of the best open-source image pre-processing tools that are optimized for real-time performance,

  • Leptonica: General purpose image processing and image analysis library. It is used by Tesseract for Image binarization and Text Segmentation
  • ImageMagick: General purpose image processing library with a long list of command-line options available for any kind of image processing job
  • OpenCV: An open-source image processing library with bindings for C++, C, Python, and Java. OpenCV was designed for computational efficiency and with a strong focus on real-time applications
  • Unpaper: It is a post-processing tool for scanned sheets of paper and can be used to enhance the quality of scanned pages before performing OCR.

Summary

While we have explored various aspects of improving the source image quality, not all of them are required for your particular use case. I would recommend you to analyze the nature of incoming documents for your OCR engine and design a pre-processing pipeline based on that. Here I will shed some light on pre-processing techniques that can be included or excluded based on your image capturing method.

  • Flatbed Scanners: When your OCR engine takes documents that are only scanned through Flatbed scanners, chances are very less that you’ll have issues with trapezoidal distortions or 3D-perspective distortion, ISO noise, but you’re more likely to have issues with skewness, orientation. Example: Fax documents
  • Mobile or Digital Camera: Most often while scanning, the document will not be parallel to the camera device which will result in perspective distortion, and skewness. Example: KYC process. In this case, it is better to assess the source image quality before OCR and ask the user to rescan when the quality of the image is poor due to distortions / ISO noise.
  • Native or Created documents as opposed to Scanned documents: In this case, there will not be any issues related to Image geometry or Noise. Users while creating documents from any software will be conscious to make it more readable which makes it easier to parse the documents using code/metadata instead of going for OCR, Example: Contract documents. However, in a few cases, users will introduce complex layouts and Fonts which increases the complexity and requires the OCR engine to retrieve text information, Example: Resumes, Newsletters, using some advanced layout analysis techniques will be helpful here.

I would recommend you to do some research on the document/image sources and design a pre-processing pipeline specific to that. And also, it is highly recommended to use compute-optimized packages for preprocessing instead of building it from scratch.

References

  1. OCR Image pre-processing techniques from Abbyy
  2. OCR Image pre-processing documentation from Tesseract
  3. Improve OCR accuracy from Doc-parser
  4. Denoising Dirty Documents — Kaggle challenge

Related Articles

Happy Learning!!! 😍

--

--

Mageshwaran R
Technovators

AI Engineer | NLP | Computer Vision. An avid reader of Neuroscience, Psychology, and Decision Making. https://mageshwaran.com