Stories by Jon Tang on Medium

Convolutional Neural Nets for Rapid Recognition of Mortgage Docs

Jon Tang — Fri, 21 Sep 2018 18:54:00 GMT

At Snapdocs, we help mortgage professionals close over 50,000 mortgages a month all across the U.S. The vast majority of these mortgage closings go smoothly, but sometimes problems come up at the signing table. From our analysis, we found that a large percentage of signing-table disputes happen because of confusion over the loan terms. For example, the cash-to-close or interest rates are not what the borrower expects. At Snapdocs, we want every closing to be perfect and error-free, so we started looking into ways to prevent this problem.

In every closing package (full set of mortgage documents), one of the most important pieces of paperwork is the Closing Disclosure, a five page form that provides the final details about the mortgage loan. A sample of the first page of a Closing Disclosure is shown below.

A sample page 1 of the 5-page Closing Disclosure document. Source: https://www.consumerfinance.gov/owning-a-home/closing-disclosure/ .

As you can see from this sample page, the Closing Disclosure holds a lot of good summary points about the mortgage. However, this form is buried somewhere within the closing package, which is often a 200+ page PDF document. We imagined how quickly identifying the Closing Disclosure page and displaying this information would resolve some of these signing-table disputes.

Challenge 1: Data Pre-processing Speed

We first attempted to build a simple text-based machine learning model to identify the first, second, and fifth pages of the Closing Disclosures, which contain the most relevant data. A text-based classifier seemed like an obvious choice because these pages of the Closing Disclosure contain very distinct keywords.

We quickly realized, however, that the required pre-processing steps for generating the data in production took too long. You first have to convert all pages of a closing document from PDF into high-resolution images. Then the images containing text need to be converted into machine-encoded text using Optical Character Recognition (OCR). The entire process could take several minutes for a 200+ page document, which is longer than we’d like to power some features that are time-sensitive. Additionally, several minutes per package wouldn’t be scalable for our current volume.

Another downside is that many of our closing documents are scanned documents, rather than electronically generated, and therefore are often marred by ink spots and visual noise. Performing OCR on these scans often results in unreliable text for accurate classification.

We stumbled upon an interesting idea after skimming through preview images of closing package pages. As you can see below, Closing Disclosure pages make use of distinctly formatted tables that make it visually distinguishable from non-Closing Disclosure pages. Very small, low-resolution images are sufficient for a human to be able to categorize them quickly at a glance.

Examples of low resolution images of first, second, and fifth pages of the Closing Disclosure, and other non-Closing Disclosure pages.

Our Approach and Results
This motivated us to try using an image-based approach. One advantage is that the only required pre-processing step is the conversion of the PDF into JPEGs. Additionally, we don’t need to generate the images at the high level of resolution (300 dpi recommended) required for accurate OCR. We found that 70 dpi resolution images were good enough, reducing the total pre-processing time from 2 seconds to 20 milliseconds per page.

That’s a drop in total pre-processing time of nearly 107-fold!

A comparison of the data pre-processing steps for using a text-based classifier versus and image-based classifier.

Challenge 2: Lack of Labeled Data

When we started this project, we had identified less than 300 Closing Disclosures from our entire database of closing packages. We knew that to create any reliable classifier from such a small dataset would be extremely challenging.

Our Approach and Results

Rather than go through our collection of closing packages and manually label more Closing Disclosure pages, we thought this problem could benefit from a technique called data augmentation. Data augmentation works by taking your existing dataset and making minor alterations to create additional labeled data. In our case, as shown below, we used each page in our original dataset as a template to create additional example images that contained slight rotations and translations that we would often find in some of our scanned images of closing pages. Because pages are often scanned in upside-down, we also included examples that were full 180 rotations.

We augmented our data by taking each page in our original dataset (in yellow) to create additional example images that contained slight rotations and translations (in blue) that we would often find in some of our scanned images of closing pages.

Using this technique, we were able to create a dataset consisting of over 250,000 images of closing package pages with more than 10,000 of them being the first, second, or last pages of Closing Disclosures. Additionally, this gave our dataset broader coverage of the types of scanned mortgage documents we see in the real-world.

Machine Learning Model

We tested several different machine learning models and hyper-parameter sets in TensorFlow until we settled on the following convolutional neural network architecture. In production, the model has a micro-average f1 score of 99% and takes a couple of seconds on average to identify our Closing Disclosure pages from an entire closing document.

Our final convolutional neural network model architecture after testing other machine learning algorithms and hyperparameter tuning.

Summary

While a text-based classifier seems like the most obvious choice for a document classification problem, our unique constraints challenged us to find a more radical approach. In our case where we required near instantaneous document classification, we chose to use an image-based classifier because the data pre-processing steps were more than 100x quicker in production.

Machine learning, computer vision, and automation are continuing to play a large role at Snapdocs as we strive towards our company goal of bringing greater efficiency, accuracy, and joy to mortgage closings. Aside from this Closing Disclosure classification model, we have developed other models to help support fast hybrid signings. For example, we use machine learning models to quickly and accurately identify pages from mortgage documents for e-signing, wet-signing, or previewing. We use computer vision to automate the task of identifying signature lines on pages that require digital signatures from the consumer. In the future, we also anticipate features such as creating a ‘Table of Contents’ for consumers to see each of their docs page-by-page. This will also allow lenders and settlement agents to swap out individual documents, instead of entire packages when there’s an error.

Come Join Our Team. We’re Hiring!

Find our work interesting? Want to help us tackle the massive mortgage closing space? We’re hiring. Come join our team!

Convolutional Neural Nets for Rapid Recognition of Mortgage Docs was originally published in Snapdocs Product & Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Understanding Our User Pain Points with Text Data Mining

Jon Tang — Sat, 07 Oct 2017 03:45:39 GMT

We used natural language processing (NLP) and analytics to learn what are the most common reasons that signing agents failed to meet expectations during signings.

At Snapdocs, we’re trying to simplify and bring transparency to mortgage closings. Check out our previous post on how our online platform provides a centralized gathering point for all parties involved in the closing.

The Frustrations of Mortgage Closings

The final signing on a property is one of the last steps in purchasing real estate and should be an exciting event for a home buyer. However, there are numerous opportunities for complication that can throw the final signing off track. From deal-breaking issues like failing to sign critical disclosure forms to minor irritations such as having a signing agent (a notary) show up late to a signing, many things can lead to bad experiences and frustrations for Escrow/Title Officers and their customers, the home buyers.

At Snapdocs, we strive to make mortgage closings run as smoothly as possible. One way that we do this is by surfacing the best qualified notaries for a signing. We are constantly collecting feedback on notary performance. Just like Uber wants to know if your driver had a clean car or drove aggressively, we want to know if your agent dressed professionally or was late to the signing.

Screenshot of Uber’s interface for collecting feedback on their drivers.

As data scientists at Snapdocs, we wanted to leverage this feedback data to gain insight into how well our notaries have met the expectations of Escrow/Title officers. In this blog post, I’ll first show you how we analyzed our data to determine how often mortgage professionals have unsatisfactory experiences with notaries during signings. I’ll then show you an approach we took using natural language processing (NLP) and analytics to learn what are the most common reasons that notaries failed to meet expectations. Finally, I’ll conclude with how we’re incorporating this information back into our product development lifecycle to improve the overall user experience.

Gathering Data on Notary Performance

On the Snapdocs platform, we enable companies to easily search a database of 65,000+ notaries, see statistics on a particular notary’s past performance, and choose the top ranked, available notary for a signing. After a signing has completed, an Escrow/Title Officer can rate a notary’s performance as positive, neutral, or negative and then include an optional detailed comment.

Screenshot of the Snapdocs interface for collecting feedback on notaries after a signing.

I looked at over 220K orders from the period of March 2017 to mid-August 2017 and found that an overwhelming 99% of signings received positive feedback ratings, 0.2% received neutral feedback ratings, and only 0.8% received negative feedback ratings! Our notaries are doing exceptionally well at meeting expectations.*

*We also found that about 40% of the signing events that received a negative rating were carried out by notaries that received a negative rating on at least one other signing. Perhaps these notaries are repeating their mistakes. In any case, our hope is that we may be able to prevent bad experiences like these in the future.

Topic Modeling of Negative Feedback Text

While negative feedback on the Snapdocs platform is rare, I decided to delve more deeply into the reasons why our customers were having these unsatisfactory experiences. Looking specifically at the text data from negative feedbacks, I sought to identify common themes using NLP techniques. I used one of the simplest approaches for topic modeling, K-means analysis, to group feedback into clusters based on the similarity of words used.

To do this, I took a standard approach for pre-processing text data. I first used the NLTK package to remove non-alphanumeric characters and tokenized the feedback text, breaking them down into their constituent words. I then filtered out stop words, which are commonly used words with presumably little meaning (e.g. ‘the’, ‘is’, ‘are’). The remaining words were then stemmed using the Porter Stemmer to reduce them to their basic form (e.g. “update”, “updates”, “updated”, and “updating” all reduce to “updat”) to facilitate identifying overlapping words. The collection of these processed words were then converted to a TF-IDF vector of weighted term frequencies so that each feedback text could be represented as a mutually comparable vector for the k-
means algorithm.

The steps we took for data pre-processing and conversion of feedback text data into TF-IDF vectors for K-means analysis.

For k-means analysis, a number of clusters or topics, k, must be specified. Choosing the best value of k can be more of an art than science, often with heuristics being used. I decided to begin with a large and diverse set of topics to work with that I could then later prune, so I chose an initial value of 35 clusters. After analyzing the results and unifying smaller similar clusters together, I settled on 30 final clusters. Each cluster was then manually labeled after taking into account the most common words in the cluster and reading example feedback content from the cluster. For example, from the word cloud (showing the most common words) and the example feedbacks for the first cluster shown below, we concluded that this cluster of feedbacks is about the notary being late to the signing.

Word cloud showing the most common words from the first cluster identified in our analysis.

Example feedback text from the first cluster identified in our analysis.

Not All Negative Feedbacks Are Equal

On our platform, if a negative rating is given to a notary after a signing, the Escrow/Title Officer also has the option to deactivate a notary so that they are no longer considered in their future notary searches. To understand which clusters of negative feedbacks were inconvenient versus those that were deal-destroying, we looked at which feedbacks led to notary deactivations.

We identified 3 major collections of topics: (1) Errors, (2) Quality and promptness of documents, faxes, and scans sent back to the Escrow/Title Officer after the signing event, and (3) Notary personal and customer facing attributes. The number of negative feedbacks that did and did not result in deactivation of the notary are shown for each of the 30 final topics identified in our analysis.

As you can see from the first collection of topics (in pink) in the chart above, making a mistake during the signing (an “Error”) is usually not enough to deactivate a notary. While minor errors make up the biggest group of negative feedbacks, only 18% result in deactivation of the notary. Talking to some of our customers in user interviews, we learned that general errors are usually correctable: you can eSign an addendum or even have the notary go back with the single page needing a correction.

We found a second collection of topics (in orange) that related to quality and promptness of documents, fax, and scans being sent back to the Escrow/Title Officer after the signing event. These were generally more rare, but had relatively higher deactivation rates (up to 39% deactivation rate for having documents missing, “Doc/Fax/Scan — Missing”). Loan closings have a very tight timeline to be completed, so even a day delay by a slow notary can put a deal at risk.

Most interestingly, the last collection of topics (in green) have deactivation rates as high as 71% and appear to include feedbacks from the worst signing experiences. Some topics in this group include notaries that either were poorly presented (“Rude”, “Unprofessional”) or put the signing in jeopardy (“No Show”, “Changed Time”, “Late Cancellation”).

Our main takeaway from this analysis is that mistakes happen and are forgivable, but people seem to have very low tolerance for jerky behavior.

Impacting Product Development

This sort of analysis helps Snapdocs in two ways. First, we can share these learnings and specific feedback with notaries directly. We’re evaluating the best place to incorporate these results in notary training, early warnings to notaries via text/email, and category-specific scoring on our platform. We want notaries to understand what their customers care about and how they can best be successful. We also want notaries to know when they are underperforming, so that we can guide them to take the necessary steps to continue doing business on the Snapdocs platform.

Second, now we have some real data representing problems for us to solve and ideas for features we could build. We’ve started building a feature we’re calling the “Signing Checklist” which should improve the likelihood that notaries walk away from the signing event confident they did a good job. We’ll be building better scheduling and communication tools so everyone –notary, escrow agent, buyer — is on the same page and there are no surprises.

While the engineers embark on the development process to incorporate our learnings from this analysis of negative feedbacks, we’re going to tackle the next data challenge: positive feedbacks! What sort of things do notaries get praised for? Which categories are highly associated with becoming a “favorite” of a company? What does “above and beyond” mean with respect to notary performance at the signing table?

Come Join Our Team. We’re Hiring!

Find our work interesting? Think you might have an alternative approach? We’re hiring. Come join our team!

Understanding Our User Pain Points with Text Data Mining was originally published in Snapdocs Product & Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.