Multi-Page Document Classification | Part-4

Qaisar Tanvir | Lead Data Scientist | Consultant

7 min readAug 2, 2021

This article describes a novel Multi-Page Document Classification solution approach, which leverages advanced machine learning and textual analytics to solve one of the major challenges in Mortgage industry. This is the part 4 of our series of blogs. You can find the links to the different parts of the series below.

Part 1: Abstract, Introduction (Background and Problem Statement, Objective), Characteristics of Documents.
Part 2: Solution Methodology (ML Classes, ML Engine, Post-Processing)
Part 3: Solution Details (Data Preparation, Data Transformation), Training Pipeline (Text Vectorizer Doc2Vec, Machine Learning Classifier, Training Procedure)
Part 4: Testing & Evaluation Pipeline, Solution Features, Conclusion

In this blog, we will discuss the testing and evaluation techniques and key factors to understand and know when adopting a machine learning technique of text classification solution pipeline. Please refer to previous parts of this series if there are any confusions.

Testing & Evaluation Pipeline

Once the pipeline is trained(which includes both the Doc2Vec model and the Classifier), The following flow diagram shows how it is used to predict the document classes for the testing data split.

The transformed testing data is passed through the trained Doc2Vec Model, where the vector representations of all the pages present in the testing data are extracted and inferred. These vectors are then classified through the classifier which returns the predicted class and the confidence score for all the ML classes.

For the detailed evaluation of the Machine Learning Engine, we generate an excel file from the results. Following table shows the columns and the information generated in the testing phase.

Evaluation excel file which is generated in the testing phase.

Page Text, File Name, Page Number : These are the same columns we had in the data preparation stage, these are just taken as it is from the source dataset.
ground, pred : ground shows the actual ML class of that page, while pred shows the predicted ML class by the ML engine.
Trained classes columns: Columns in this section represent the ML classes on which the model was trained on and the confidence scores for those classes.
MaxProb, Range : MaxProb shows the max confidence score achieved by any of the columns in Trained classes section. See the red colored text, Range shows the range in which the MaxProb falls in.

Currently there are three levels of results evaluation.

Cumulative Error Evaluation Metric
Confusion Matrix
Class level confidence scores analysis

Cumulative Error Evaluation Metric

This evaluation calculates two metrics, Accuracy and F1-Score. For more details check this blog. These provide us an abstract insight into goodness of the pipeline. The scores can be between (1–100). where higher number represents how good the pipeline is in classifying the documents. In our experiments, we got the following accuracy and f1-score.

Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

Essentially, it makes it easier to understand:

Which classes are not performing well?
What is the accuracy score of an individual class?
Which classes are confused with each other?

The below plot represents the confusion matrix we generated after our testing. It is an embed link so click to view the confusion matrix.

Confusion Matrix Plot

Values on both X-axis (True labels) and Y-axis (Predicted labels) represents document classes which we trained on. The numbers within the cells show the percentage of testing dataset belonging to the class on the left and bottom.

The values at the diagonal represent the percentage of data where the predicted classes were correct. Higher percentage is better. i.e. if 0.99 then it means 99% of the testing data for that particular class was predicted correctly. All the other cells show wrong predictions and percentage shows how much a certain class was confused by the another class.

As it can be seen, that the model is able to correctly classify most of the ML classes with more than 90% accuracy.

Class level confidence scores analysis

Although the confusion matrix gives details about the class confusions, but it doesn’t represent the confidence scores of the predictions. Which in other words means

“How confident the model is, when making a prediction about a document class?”

What is the need?

In the ideal situation, model should have high confidence when predicting a correct ML class, and low confidence when predicting a wrong ML class. But this is not a strict behavior and depends on many factors i.e. performance of a particular class, actual domain similarities between document classes etc. To evaluate, whether this behavior exist and confidence scores can be a useful indication of a true predictions, we devised an additional evaluation approach.

Approach

Since the task is to reduce the manual work, it was decided that only the predictions with high confidence will be chosen.This way wrong predictions will be not happen (because those wont have high confidence). Rest of the documents and pages will be verified manually by the BPO.

Threshold

In this step confidence scores of the classes are calculated and the threshold is defined, Threshold is a percentage i.e. 80%, 75% which is decided based on following conditions.

What is the confidence score value where, wrong predictions are in insignificant numbers and true predictions are in higher numbers. In other words, It is about finding the sweet spot.

Following line plot shows the true positives (blue line) and false positives (red line).

X-axis shows the ML classes, and Y-axis shows the percentage of the testing data for a particular class, which is covered by true positives or false positives.

For example: in case of the the ML class 1330, true predictions cover almost 70% of the whole testing data-set for that class. Which means ML engine was able to predict 70% of the data right, with confidence score greater than 90%. Moreover the false positives covered only 1% of the testing data-set, which means only 1% of the test data was predicted wrong with confidence score higher than 90%.

Although, because of the threshold, sometimes we lose on true positives (when confidence score is less than threshold). But that is not as bad as the false positives with high confidence. Such pages/documents will be verified manually.

The previous plot is made with threshold (90% and above). In the following plot, threshold is (80% and above). Notice that even if the threshold is dropped to 80% the false positives do not increase, while true positives increase significantly. Which means, that between 90% and 80% thresholds, 80% is optimal.

While doing this analysis all the levels are checked i.e. 50%, 60%, 70%. The most optimal threshold is chosen using this evaluation metric.

Solution Features

Fast Predictions | The classification time for one page is under (~300ms). if we include the OCR time, one page can be classified well under 1 second. Moreover, if multi-processing is adopted,
High Accuracy | The current solution pipeline is able to identify and classify documents with high accuracy and high confidence. In most of the classes we get more than 95% accuracy.
Labeled Data Requirements | Within our experiments we have observed that the pipeline can work good with most 300 samples per document class. (Like in the experiment we discussed in these blogs). But this is dependent on the variations and type of document class. Moreover, we see accuracy and confidence scores increasing with more sample counts.
Confidence Score Threshold | The pipeline provides prediction confidence scores, which enables a tuning approach, and allows to tune between the True Positives and False Positives.
Multi-Processing | The Doc2Vec implementation allows for multi-processing, Moreover our data transformation scripts are highly parallelized.

Conclusion

In this series of blogs, we briefly discussed the core challenges faced by the businesses and industries using scanned documents, specially in Mortgage. We talked about the data collection, preparation and transformation parts. We adopted a novel approach, where by leveraging advanced machine learning and neural network algorithms, we used textual information present in the documents to learn different distinguishable patterns and aspects of a document. We discussed different components of the solution in detail, and how these components combine together to build a solution pipeline.

Machine learning and Natural Language Processing has been doing wonders in many fields, we see first hand, how it helped to reduce the manual effort and automated the task of Document Classification. The solution is not only fast, but also very accurate.

Because of the sensitive nature of data used in this process. The code base is not available. I will rework the codebase on some dummy data which will allow me to upload it to my github. Please follow me on github for further updates. Also check out some of my other projects ;)