Multi-Page Document Classification | Part-1
This article describes a novel Multi-Page Document Classification solution approach, which leverages advanced machine learning and textual analytics to solve one of the major challenges in Mortgage industry. This is the part 1 of my series. You can find the links to the different parts of the series below.
- Part 1: Abstract, Introduction (Background and Problem Statement, Objective), Characteristics of Documents.
- Part 2: Solution Methodology (ML Classes, ML Engine, Post-Processing)
- Part 3: Solution Details (Data Preparation, Data Transformation), Training Pipeline (Text Vectorizer Doc2Vec, Machine Learning Classifier, Training Procedure)
- Part 4: Testing & Evaluation Pipeline, Solution Features, Conclusion
Abstract
Even in the today’s technological era most of the business is done using documents and the amount of paperwork involved will vary industry to industry. Many of these industries need to scan through scanned document images (which usually contains non-selectable text) to get the information for key index fields to operate their daily tasks.
To achieve this, the first major task is to index different types of documents, which later helps in extraction of information and meta-data from a variety of complex documents. This blog post will represent how Advanced Machine learning and NLP techniques can be leveraged to solve this major part of the puzzle, formally called Document Classification.
Introduction
In the mortgage industry, different companies perform mortgage loan audits of thousands of people.
Each individual audit is performed on an assortment of documents, submitted as a bundle which is called a Loan Package. A package is a combination of scanned pages, which can vary from (100–400~) pages. There are multiple sub-components within the package which may consist of (1–30~) pages. Such sub-components are called Documents or document classes. Following table represents this visually.
Background and Problem Statement
Traditionally, while evaluating the loan audits, Document Classification is one of the major parts of the manual effort. The mortgage companies mostly outsource this work to third party BPO companies, which execute this task by using manual or partially automated classification techniques i.e rule engines, template matching. The underlying problem which is faced by the current implementations is that the Business Process Outsourcing (BPO) staff has to manually find and sort the documents present in the packages.
Although, some degree of automation is achieved by few third-party companies using keyword searches, regular expressions etc. The accuracy and robustness of such solutions are questionable and their manual workload reduction is still not satisfactory. Keyword searches and regular expressions means that these solutions need to account for every new document or document variations which are presented and also need to add rules for that. This in itself becomes a manual effort and only partial automation is achieved. There still remains a chance where the system might identify a document class to be “Doc A” but it is in fact “Doc B”, because of common rule present in both. Additionally, there is no degree of certainty towards an identification. More often than not, manual verification is still required.
There are several hundred document types, the BPO staff needs to have a knowledge base of “how a certain document looks, and what are the different variations of the same document?”, in order to classify documents. On top of that, if the manual work is too much, Human error tends to increase.
Objective
The document classification solution should significantly reduce the manual human effort. It should achieve a higher level of accuracy and automation with minimal human intervention
The solution approach which we will be discussing in this series of blogs, is not only limited to Mortgage industry, it can be applied where ever there are scanned document images, and sorting of such document is required. few of the possible industries are financial organizations, academia, research institutes, retail stores
Characteristics of the documents
In order to make a solution pipeline, first step is to know what is the data and what are its different characteristics. Since, we have been working in the mortgage domain, we will define the characteristics of data we process in the mortgage industry.
Within a package, there are many types of pages, but generally these can be categorized in three types:
Structured | Consistent forms and templates
Unstructured | Textual, no formatting and tables
Semi-Structured | Hybrid of above two, may have partial structure
In terms of documents, following are the characteristics which are observed in the data.
- The documents present in the packages are not in a consistent order. For example i.e. in one package document “A” might come after document “B” and in the other one it’s the other way around
- There are many variations of a same document class. One document class can have different looking variations, for example a document class “A” page template/format might change for different US states. Within mortgage domain, these represent the same information, but have difference in formatting and contents. In other words if “cat” is a document, different breeds of cats would be the “variations”.
- The document types have different kinds of scanned deformities i.e. Noise, 2D and 3D rotations, Bad scan quality, Page orientation, Which messes up OCR for those documents.
In this blog, we have briefly discussed the different aspects of the problem. in the next blog we will be talking about the methodology which is used to implement this solution. Following is the link.