How to Classify Mortgage Documents Using OCR & AI

And how to apply it in mortgage operations

Published in

OpsFlow

8 min readDec 17, 2023

Advances in AI & OCR have made it possible to automate mortgage document classification with high accuracy. Let’s see what automated mortgage document classification means for your mortgage operations and how to implement it.

In this post, we’ll cover:

What Automated Mortgage Document Classification Means
How You Can Use Automated Document Classification in Mortgage Operations
How to Automate Mortgage Document Classification
How to Implement Document Classification Workflow
What Providers are Available for Mortgage Document Classification

‍

What Automated Mortgage Document Classification Means

Mortgage Document Classification is the process of identifying mortgage document type and its boundaries in a single file.

Given a pdf file as input, Automated Mortgage Document Classifier will provide you with:

List of the documents contained in the PDF file (e.g. W-2, URLA, etc)
Specific locations of each document within the file( e.g. URLA: pages 4–14)

An example of document classification is the loan processor reviewing files, splitting them into individual documents, and organising them in the LOS.

In this context, the loan processor acts as a human document classifier.

Automated document classification is the same process but done by software instead of humans.

‍

👉 Side note: Document Indexing vs. Document Classification

Sometimes, these terms may be used interchangeably, but they differ in what they mean.

Classification is the process of identifying the type of each mortgage document and determining its boundaries within a single file. The goal is to understand what documents are in a file and where they are located.

Indexing is the process of organising these documents within a storage system for easy retrieval. This process typically includes splitting files into individual documents, appropriately renaming them, and storing them in the correct folders.

‍

How You Can Use Automated Document Classification in Mortgage Operations

Humans and software systems process mortgage documents differently depending on the document type.

For example, when LO reviews bank statements, they look for one piece of information. When the same LO reviews the credit report, they look for another.

To process mortgage documents effectively, humans and software systems must:

Know what documents are available on the borrower file
Have documents organised for easy retrieval
Be aware of what documents they currently process

But the problem is that it’s common for a single PDF file to contain multiple documents.

For instance, a correspondent loan package can easily have 15+ different documents in a single 100+ page PDF file. What makes it difficult to effectively process.

AI Mortgage Document Classifier can provide information about documents available and their boundaries within unclassified files like correspondent loan packages.

What enables software systems to effectively process these documents and automate:

Splitting files into documents
Indexing documents in your LOS or other file storage
Identifying missing documents on file
Extracting structured data from the documents
Detect fake documents

The role of Automated Document Classification in Mortgage Operations is to enable effective document processing by providing information about documents available and their boundaries within unclassified files.

‍

How to Automate Mortgage Document Classification

1. Pull Files from Upstream Integration

Your document classification system must first receive the files to classify them.

Thus, the process begins with your system pulling files for classification from various sources.

Common sources include:

Loan Origination Systems (LOS)
Emails
FTP Servers
Dropbox Folders

2. Classify Documents In Each File with OCR & AI

Once your system has files, run each file through the document classifier.

As an output, you should have the following for each file:

List of the documents within the PDF file (e.g. W-2, URLA, etc)
Locations of each document within the file( e.g. URLA: pages 4–14)

3. Review and Correct Classification

Sometimes, ML can’t accurately classify documents and identify boundaries.

In this case, we need to loop in humans to review the classification and correct if it is wrong.

Usually, AI document processing products offer out-of-the-box Human-In-The-Loop (HITL) interfaces to handle this workflow.

4. Split Files into Documents

After review, you have accurate data about the documents in each file and where they’re located.

Most downstream integrations consume single-file documents, but your documents are still in the original files at this step.

So, the next step is to split the original files into documents.

Side note: A more accurate name for files containing a single document would be single-document files. But for the sake of simplicity, I refer to them as documents.

5. Push Data into Downstream Integrations

Once you have a list of single-document files, you can feed this data into other systems to automate your mortgage operations.

Here are some common destinations & automation:

Document Indexing System → sort, rename and save docs in the right places in the LOS
Automated Underwriting System → identify missing documents on file
Data Extraction System → extract structured data from the documents
Document Fraud Detection System → identify fake documents

After document classification, downstream integrations have the data to process each document effectively.

Some systems (e.g. Data Extraction, Fraud Detection) will process each document separately, while others (Indexing, Underwriting) will handle them in bulk.

‍

How to Implement Automated Document Classification Workflow

Below, you can find how to approach building your automated document classification workflow outlined above.

1. Understand what Document types and Integrations you work with

Start by defining where you need classified documents and why.

Then, make a list of the document types you need to classify.

Once you have a list, determine where the unclassified files will come from.

You should have:

List of document types to classify
List of up-stream integrations
List of down-stream integrations

2. Get Document Classification Model

The next step is to get a model that will be able to classify the document types you defined in the previous step.

To get this model, you have 2 options:

Train your model (for example, Google Document AI, Amazon Textract, Azure Form…)
Rent pre-trained model (for example, DocSumo, Super.ai, Ocrlous, etc.)

You can find more details about the differences between these options in the section below.

By the end of the step, you should have an ML model that can classify document types you have.

3. Piece it all Together

Once you have a model, the next step is implementing the document classification workflow.

Connect to upstream integration to get the files
Feed unclassified files into the document classification model
Split files into the documents (single-document files)
Push document list and documents into the downstream integration.

By the end of this step, you should have an end-to-end document classification workflow, from getting raw files to pushing classified documents into downstream integrations.

4. Review, Correct and Up-train

The last step is to fine-tune and up-train your models to improve accuracy.

That’s especially true for classifying mortgage documents, as fewer providers have pre-trained models for the mortgage industry.

So unless you find a provider that already has pre-trained models for every document type you need to support, there will be a period where you’ll need to invest more time into up-training.

The process will involve reviewing and correcting document classifications that have low accuracy.

You can use your workforce or self-managed labelling services from providers like (Ocrlous, Super.AI.)

By the end of this step, you should have a document classification system that processes most of the files with high accuracy. And only in rare cases does human involvement need to correct documents that have low confidence.

‍

What AI & OCR Providers are Available for Mortgage Document Classification

Quite a few AI document-classification products & tools are available on the market.

Their main difference is the degree to which they work for mortgage documents out of the box.

And it comes down to how many steps of the 5-step process they cover:

Do they have upstream integrations with your mortgage software?
Do they have downstream integrations with your mortgage software?
Do they have a pre-trained model to classify mortgage documents?
Do they automatically split files into documents?
Do they have human-in-the-loop services included?

Some of the solutions cover all 5 steps. Other solutions cover none.

The less customisation you need, the higher the cost per document you can expect.

The more you invest to get it working, the less cost per document is.

Featured Providers

Here, you can find a list of providers that you can use to automate mortgage document classification. That’s not an exhaustive list of the providers; these are the ones that, in my opinion, are the most relevant for mortgage document data extraction.

Low-level:

💡 Low-level solutions are the ones that need the most engineering involvement to make them work for mortgage documents. But they tend to have the lowest cost per document.

Mid-level:

💡 Mid-level solutions are usually built on top of one or multiple low-level solutions and remove some complexity in implementation. Most come with pre-trained models relative to the mortgage industry and have up/down-stream integrations with popular mortgage software.

Specialised:

💡 Specialised solutions are usually built on top of mid-level solutions. They take them further by providing out-of-the-box automation using the data they extract.

What’s next?

I hope this post helped you get an insight into how to use OCR & AI to automate mortgage document classification.

If you’d like to stay on top of the latest mortgage tech and how it can be applied to mortgage operations, consider joining our mortgage technology newsletter.