Document Analysis Is More Than Processing Text

Xiangqian Hu
Oct 16 · 5 min read
Image for post
Image for post
Photo by Wesley Tingey on Unsplash

It’s not hard to understand why businesses want to use technologies to deal with their documents. Given the massive and growing amount of documents to process, machine help is inevitable. And machine analysis has shown greater efficiencies in everything from processing medical records and insurance claims to detecting frauds in emails.

The success of any given document processing project, however, is far from preordained. Those who think of their documents simply as text may be caught off guard by a project’s difficulty and complexity.

For clarity, let’s define document analysis as analyzing and extracting information from digital documents that contain rich components such as text and graphs. The daunting challenge of building machines for this task covers plenty of disciplines, including database systems, image processing, natural language processing, pattern recognition, and machine learning.

Why Is Document Analysis So Hard?

Big Amount: Supervised Learning Requires Human Labeling

Big Diversity: Extreme Data Diversity Goes Way Beyond Text

Overall, documents are heterogeneous and unstructured. While these content and format varieties are usually suitable for human comprehension and analysis, they can be difficult for machines to organize, analyze, and extract.

Take PDF text extraction as one example. A PDF file could contain digital or scanned text, off-page or small characters, and strange font formats. One PDF file could be extremely long with different layouts and languages. Moreover, the PDF file is often not composed of only text.

Big Complexity: Words, Formats, and Models

Document format diversity makes the analysis pipeline even more complicated. For instance, computer vision is necessary for optical character recognition (OCR) to convert scanned documents into digital ones, which later are used in NLP. Therefore, the pipeline often requires multiple machine learning models to analyze documents.

This pipeline complexity further complicates data preprocessing and labeling as well as model development and management. Processing and labeling documents with high quality requires the capabilities of reading and understanding for a given language. Data bias could be unintentionally introduced during this process, and this bias factor can be amplified further when multiple models are developed. These models typically are of different types, spanning different ML disciplines. Therefore, model auditing becomes another necessary component for a mature document analysis pipeline.

How We Do Document Analysis At Infinia ML

Image for post
Image for post
Figure 1. Infinia ML Cloud Layer. Learn more about Infinia’s Approach to Machine Learning [VIDEO].

The middle Infinia ML Cloud Layer contains our core technologies with four building blocks, which are designed to be cloud-native and cloud-aware. All these blocks are interconnected seamlessly and can be readily customized in terms of different customers’ needs.

  1. The Cloud Infrastructure block handles data input/output, software developments, deployments, system maintenance, security, and scalability. This powers our entire development cycle for coding, modeling, UI, and middle-tier business logic.
  2. The Library block is a mixture of open-source packages (such as scikit-learn and PyTorch) and our in-house ML technologies. We have absorbed our data science experience and brand-new ML ideas and methods into this reusable package. This speeds up our model developments for customers.
  3. AI/ML systems without auditing cannot be trusted. We have built our Auditor with user-friendly UIs to monitor model performance and audit machine learning pipelines.
  4. The Document Analysis block is our specialized ML application with UIs, which is designed to analyze documents, extract data, and display and review document information.

Analysis results are domain-specific and depend on customer needs. They could contain the retrieved documents from a search query, or they could be the extracted information from scanned documents such as addresses, phone numbers, company names, invoice amounts, and so on.

We are also strong believers in involving humans in the loop. The entire ML pipeline needs to be supervised by our domain experts. Their feedback can be added back into our document analysis process.

In conclusion, machine-driven document analysis is not easy in practice, and pure text analysis is not sufficient for machines to analyze documents. We hope sharing our own experience can help inspire new ideas and speed up your document analysis processes. After all, we might never know when machines will learn by themselves — but we do know people always learn.

The author would like to thank James Kotecki for his valuable feedback to this blog.

Machine Learning in Practice

Practical insights for executives, managers, and project…

Xiangqian Hu

Written by

Director of Infinia ML Engineering. Machine Learning Lover.

Machine Learning in Practice

Practical insights for executives, managers, and project managers eager to deploy machine learning inside their company.

Xiangqian Hu

Written by

Director of Infinia ML Engineering. Machine Learning Lover.

Machine Learning in Practice

Practical insights for executives, managers, and project managers eager to deploy machine learning inside their company.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store