Multi-Page Document Classification | Part-2

Qaisar Tanvir | Lead Data Scientist | Consultant

5 min readAug 2, 2021

This article describes a novel Multi-Page Document Classification solution approach, which leverages advanced machine learning and textual analytics to solve one of the major challenges in Mortgage industry. This is the part 2 of my series. You can find the links to the different parts of the series below.

Part 1: Abstract, Introduction (Background and Problem Statement, Objective), Characteristics of Documents.
Part 2: Solution Methodology (ML Classes, ML Engine, Post-Processing)
Part 3: Solution Details (Data Preparation, Data Transformation), Training Pipeline (Text Vectorizer Doc2Vec, Machine Learning Classifier, Training Procedure)
Part 4: Testing & Evaluation Pipeline, Solution Features, Conclusion

Solution Methodology

In this section we will abstractly explain how our solution pipeline works, and how each component or module comes together to produce an end-to-end pipeline. Following flow diagram of the solution.

Since, the goal is to identify the documents within the package, we had to identify what kind of characteristics within a document, makes it different from another one?. In our case we decided that the text present in the document is the key, because intuitively we humans also do it this way. The next challenge was to figure out the location of the document within the package. In case of multi-page documents, boundary pages (start, end) have the most significance. because using these pages, range of document can be identified.

Machine Learning Classes

In terms of Machine learning, we treated this problem as a classification problem. Where we decided to identify the first and last pages of each document. we categorized our Machine Learning Classes (ML classes) in three types:

First Page Classes : These classes are the first pages of each document class, which will be responsible to identify the start of the document.
Last Page Classes : These classes are the last pages of each document class, which will be responsible to identify the end of the document. These classes will be made only for the document classes which have samples with more than one page.
Other Class : This class is a single class which contains the middle pages of all the document classes combined into one class. Having this class helps the pipeline in the later stages, it reduces the instances where a middle page of a document is classified as first or last page of the same document, which intuitively is possible, because there can be similarities between all the pages such as headers, footers and templates. This allows the model to learn more robust features.

Following diagram represents, how these different types of ML classes would look like in terms of package and documents

Here A, B are the first page classes of document A,B. Moreover A-last, B-last are the last page classes of the same documents. All the middle pages of any document class are considered as the Other class

Machine Learning Engine

Once the ML classes are defined, the next step is to prepare the dataset for training the Machine Learning Engine (The data preparation part will be discussed in detail in the next sections). Following diagram explains the inner workings of the Machine Learning Engine, and is a more technical view for the solution pipeline.

Lets step-by-step describe different phases of the solution.

Step 1

Package (which are in pdf format) is split into individual pages (images)

Step 2

The individual pages are processed through an OCR (Optical Character Recognition), which extracts the text from the image and generates the text files. We have used a state-of-art OCR engine to produce the text in our case. There are many free online offerings of OCR which can be used in this step.

Step 3

The text corresponding to each page is then passed to the Machine learning engine where the Text Vectorizer (Doc2Vec) generates its feature vector representation, which essentially is a list of floats.

Step 4

The feature vectors are then passed to the classifier (Logistic Regression). The classifier then predicts the class for each feature vector. Which are essentially one of the ML classes which we have previously discussed (first, last or other)

Additionally, the classifier returns the confidence scores for all the ML classes (the section on the most right of the diagram). For example let (D1,D2 ..) be the ML classes then for a single page the results may look like the following.

Post Processing

Once the whole package is processed, we use the results/predictions to identify the boundaries of the documents. The results contain the predicted class and the confidence scores of the predictions for all the pages of the package. See the following table

Following is the simple algorithm and steps which are used to identify the Document boundaries using the output from the Machine learning engine.

Document Range Identification | Post-Processing Algorithm

Using this algorithm Multi-page Documents can be identified in the Packages. In this blog, we have briefly discussed the various steps of our solution pipeline. we discussed the methodology in an abstract yet, technical manner so that it is intuitive to understand how this solution is layed out. In the next blog we will go deeper into different components of the solution. which will involve data preparation strategy, deeper insight into machine learning models. Following is the link.

Next Blog

Part 3: Solution Details (Data Preparation, Data Transformation), Training Pipeline (Text Vectorizer Doc2Vec, Machine Learning Classifier, Training Procedure)