Why Document Capture is so Challenging?

Document Capture, Document Processing, Intelligent Document Extraction, Neutron, V2Solutions, Document Extraction

The Global Document Capture Software market is expected to reach $9.21 billion by 2026 growing at a CAGR of 13.8% from 2018 to 2026.

Increasing demand for big data and analytics in organizations is a key factor that’s driving the market growth.

Document Capture technologies are not new. Data-capture from structured documents like medical claims and tax forms is a common use-case and has been around for close to 20 years. But with advances in imaging and scanning technologies, organizations, large and small have increased options to automate labor-intensive data-capture tasks.

What Digitization of Documents does?

- It makes physical storage space redundant.

- Simplifies record-keeping

- Facilitates indexing documents.

- Improves search capability

- Prevents lost records.

End-user implementations of Intelligent document capture from unstructured documents such as contracts are relatively new and gaining momentum. The key drivers for using data-capture technology are improved search capability and digitized information management across the enterprise.

Document-capture can have a radical impact in terms of an organization’s content management strategies. Then there are other tangibles; higher productivity, cost-savings, and improved service quality.

Ideally, in what is called as “Early-Archiving”, data-capture must be done very close to the source of data and metadata for the content should be curated. Document availability can be improved with standardized tagging, categorizing, and indexing. If possible, auto-tagging should be implemented. None of these are easy-to-do processes due to the plethora of document types and content. Yet, overlooking these aspects inevitably leads to re-work. In fact, it has been estimated that 40% of all document imaging and scanning work is rework. That’s why the whole idea must be to avoid the major pitfalls in the first place.

Issues in Document Imaging and Data-Capture

· Warping causes shade over the text-block and distorts text lines. This reduces the OCR accuracy.

· Perspective distortion causes the images to appear farther and are not easily readable to the OCR systems

· Different hardware techniques have different spatial, intensity, and color quantization mechanisms.

· Focusing and Zooming affect character recognition.

· Non-planar surfaces of document pages cause warping disturbances and cannot be readable by the OCR systems. Text, lines, and shapes will be deformed.

· The images captured by low-resolution digital cameras are not readable by OCR systems.

· More Complex background makes OCR more difficult.

· Uneven lighting affects OCR accuracy.

· Wide-angle-lens distortion is focus, lighting and layout problems caused due to focus-free and digital cameras that come equipped with a cheap wide-angle lens

· Sensor noise in digital cameras exacerbated by amplifiers, high shutter speed, etc. distort image-capture and affects OCR

· Document images may have a noisy black border or contain noisy text regions.

Many novel approaches have been introduced over the years for performing OCR on scanned documents to offset these distortions, degradation, and noise.

Document Processing

OCR systems provide electronic representations of printed documents by scanning them. The document images are scanned in the OCR systems. Then the data is analyzed, extracted, and most often converted into machine-readable formats.

Thus, a direct application of OCR helps user eliminates re-keying content with automation – without severely compromising on accuracy. Intelligent character recognition (ICR) systems are OCR systems enhanced to create electronic representations of handwritten data which also can be converted into machine-readable formats.

Issues that plague Document Extraction

· Poor Quality Images can make the information less accurate.

· The challenge in finding information in unstructured documents where the desired text can be anywhere within the document.

· Machine’s ability to discern incorrect data

· Unrecognized printed text

· Limitations in the ability to recognize printed text.

· Language attributes: There are a variety of languages and some with reverse reading order. Fonts and symbols may not get recognized easily. It is difficult to have an OCR system that performs well across these varying attributes.

Document Analysis techniques data-capture tools used are usually based on applying complex pattern recognition methods to RGB images. Overall, the procedures involved with processing document images or images of text will require text detection, localization, extraction, geometrical normalization, enhancement/binarization, and recognition.

The pre-processing task involves the detection and removal of dead pixels. The other two significant pre-processing operations are dilation and erosion. Object expansion can be done by dilation. It potentially fills in small holes and connects disjoint text objects. Erosion shrinks text objects by etching away their boundaries.

Being Realistic with OCR In Document Processing

OCR, as of today doesn’t have a standardized solution. Document scanned under adverse conditions exhibit uneven illumination, low contrast, speckle noise, degraded character glyph, rotational skew, and page warp. OCR performs poorly for such documents.

Manual pre-processing of such documents which may run into large volumes is prohibitively expensive. OCR system-based document extraction can be 85% to 95% accurate and these accuracy rates are sometimes even higher than what is typically achieved by data-entry operators. Many a time, such automated extraction systems are scrapped jut because of the expectation that a machine-based system should be 100% accurate.

With today’s technology, 100% accurate OCR for a corpus of documents is near impossible. This is because OCR performs poorly in detecting and correcting errors in the document analysis phase. Nearly half of all conversion errors are unrecognized by the OCR program.

Such errors require corrections to be made manually and this is often a painstaking task that compromises the viability of data-capture systems. In this scenario, any single commercial-off-the-shelf (COTS) application may not be able to meet an organization's document automation objectives.

Other than selecting one product over another by comparing features, there is little scope to influence document processing. In house or custom-built solutions are way too unfeasible for many organizations. An end-result of this is that operations and process-owners may have to adjust internal policies, procedures, and document artifacts to get the solution to work. The lack of flexibility and adaptability of data-capture systems remain an area of concern.

Discover how we can help you with our Document Processing Capabilities by visiting us at Neutron – Intelligent Document Processing Platform.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store