Modern-day challenges with document image analysis

OCRs are failing at extracting all the useful data intended for analysis from user-generated images of documents

Kumar Tanmay
6 min readJul 26, 2018


Mobile technology is increasingly playing a pervasive role both at home and office. Companies now allow users to capture pictures of documents very conveniently through a smart phone to quickly and safely send to intended recipients for extracting useful information.

In the early 1960s, optical character recognition (OCR) was one of the first clear applications of pattern recognition, and today, for some simple tasks with clean and well formed data, document analysis is perceived as a solved problem. Unfortunately, these simple tasks do not represent the most common needs of the recipients of document image analysis. The challenges of complex content and layout, noisy data, and variations in font and style presentation keep the field of pattern recognition in document image analysis active.

(Also read: Why OCR-ing a bank-statement is a risky idea)

Document image analysis is carving a niche out of the more general problem of computer vision because of its pseudo binary nature and the regularity of the patterns used as a “visual” representation of language.

Portable document scanners in our hand

The increasing availability of hand-held cameras that usually have cheap sensors attached to cellular phones has created an opportunity for supplementing traditional scanning for document image acquisition and analysis. It can capture images of thick books, multi-page scripts, road signs and text in scenes, making cameras on mobile phones much more versatile than desktop scanners. In fact, it has become one of the key media to capture documents that speed up KYC by making it paperless and people-less.

The industry has sensed this direction and is shifting some of the scanner-based OCR applications into new use cases, e.g. CamScanner allows a user to convert the image into high-quality document before saving or sharing with intended recipients. Google Translate is fast becoming the idea of universal language translation device that includes seamless conversation and foreign text translation. Touch screen phones now enable a fingertip to select or focus an area on a document and recognise the selected printed or handwritten symbols. Intelligent digital cameras identify and translate signs written in foreign languages too.

World Lens function on Google Translate allows users to instantly translate street signs in foreign language.

So what’s the problem?

Scanners use high-quality image sensors and ideal conditions of planar surface, no distortion, uniform lighting and no background noise. When high-quality device is replaced by devices meant for daily life, these flexible conditions introduce new processing demands in images loaded with text such as those in financial documents. The common problems in poor quality text images is due to limited sensitivity of low-cost camera.

Not every company has the technology to convert the user-generated image into a high-quality machine readable document.

The difference in the composition of images defines the challenges of extracting the text content. A majority of the work on camera-captured data has been done in the area of processing image and video text from broadcast video or still images with large font sizes, rather than on processing images of structured documents. Mobile phone generated heavy-text document images pose a number challenges when compared to extracting data from the same images sourced from desktop scanners. Segmentation of text from degraded document images is a very challenging task due to the high inter/intra-variation between the document background and the foreground text of different document images. Some of the most common challenges that we are facing:

  • Uneven lighting — Uneven lighting conditions is common due to physical environment (shadow, reflection, fluorescent) and uneven response from devices. In case of flash lights, centre of view is brightest while the lighting decays out radially.
  • Perspective distortion — occurs when text plane is not parallel to the image plane. The effect is that the farther text looks smaller & distorted. The parallel-line assumptions are no longer true for the text.
Perspective distortion in an image captured by mobile camera
  • Non-planar surfaces — Text can appear on any surface, not necessarily on a plane. Pages of documents often curled as it gets old. This is called warping effect. Just like perspective distortion, even warping can fail most of OCRs.
Non-planar images of bank statements captured by mobile phones
  • Wide-angle-lens distortion — As an imaged object gets closer to the image plane, lighting, focus, and layout distortions often occur on the periphery. Since many focus-free camera phones come with a cheap wide-angled lens, distortion is often a problem in case of document analysis
  • Complex background — Often users are completely unaware of the intended text or document getting imaged with unnecessary background. A non-uniform background makes segmentation of the document extremely difficult.
An image with a complex background
  • Zooming and focus — Since many camera phones are designed to operate over a variety of distances, focus becomes a significant factor. Sharp edge response is required for best character segmentation and recognition. At short distances and large apertures, even slight perspective changes can cause uneven focus
  • Intensity & Colour quantisation — In an ideal imaging device, each pixel in a photon sensor should output the luminance of inbound light and colour components corresponding to the frequency of light. Current camera-phones can easily under/ over-expose due to their small photon sensor size.
  • Sensor noise — Dark noise and read-out noise are the two significant sources of noise at sensor stage in camera phones. The higher the shutter speed, the smaller the aperture, the darker the scene and higher the temperature, the greater the noise.
A low resolution bank statement with a lot of noise due to cheap sensor
  • Compression & low resolution — Most images captured by captured are compressed either at source or during transfer. e.g. images transferred over messaging platforms such as WhatsApp are compressed to fasten file transfer. While OCRs are tuned to read resolutions between 150 and 400 dpi, the same text in a compressed image maybe below 150dpi.

Cellphone cameras have become so increasingly popular, and they have potentially become an alternative document imaging device. Although it cannot replace scanners, it is small, light, easily integrated with various networks & apps, and more suitable for many documents capturing tasks in less constrained environments. These advantages are leading to a natural extension of the document processing community where cameras are used to image hardcopy documents or natural scenes containing textual content.

As the number of large-scale digitization projects involving heterogenous content continues to grow, there is a compelling need for reliable and scalable triage methods for enhancement, segmentation, classification and categorisation of document images.