Building a Financial Document Understanding Platform

Published in

Intuit Engineering

11 min readNov 15, 2019

How Intuit is deploying AI to eliminate data entry from financial documents

This blog post is co-authored by Amar Mattey, Joy Rimchala, TJ Torres and Xiao Xiao from the Document Understanding Platform.

Why Document Understanding ?

Accessing embedded document data has become one of the most highly sought-after technologies in sectors such as financial services, real estate, insurance, government agencies, and healthcare institutions. These industries all share a central challenge: how can we automate document processing to extract the fundamental structured information they contain?

The IRS alone deals with approximately 40 million paper forms per filing season¹, and every year Intuit’s customers upload millions of financial documents to our global financial platform using TurboTax, Quickbooks, and Turbo for the purposes of tax filing, payroll processing, and tracking business expenses. These documents include tax forms (1040, W-2, 1098, etc.), statements, bills, receipts, and payroll records. In each case customers are faced with the painful task of manually entering the data from these documents into structured forms for better digital processing, which adds thousands of hours of collective work.

To solve this problem, at Intuit we have built a suite of technologies we call the Document Understanding Platform (DUP), geared toward information extraction across a wide range of documents uploaded to the Intuit ecosystem. The platform encompasses capabilities to classify documents, extract key entities information, dynamically query users for correction, and ingest corrections to enable continual learning. In this blog post we’ll cover the different capabilities of the platform in detail and how we are leveraging them collectively to help eliminate the tedium of data entry while improving data accuracy.

Document Understanding Platform in a Nutshell

Key steps in a document understanding workflow are:

1) Understanding the broad types of documents and the contents within each that are needed for digitization

2) Understanding what types of information the documents contain

3) Accurately extracting and enriching important pieces of document information

4) Creating a searchable document organization structure to make the data accessible, secure, and easy to use

To achieve this, we’ve combined the strength of general document understanding AI/ML techniques (computer vision, natural language understanding, and machine learning) with financial domain knowledge.

Document Understanding Platform Components

1. Document Classification

Classification is the first step in the pipeline towards document understanding. Because different financial documents and tax forms often contain different information, being able to identify the category of documents can facilitate downstream extraction services by shedding light on what information to look for. In addition, this capability is also beneficial to individual tax filers to inform them which tax forms are available, as it is to accountants and financial professionals who can use it to identify pages with useful information from a large number of documents.

Our goal is to create a user experience powered by machine learning where each page of users’ financial documents can be classified in real time with high accuracy. In machine learning this is a multi-class classification problem, with the classes being the document types we want to target. The majority of Intuit’s financial documents can be categorized into a well-defined set of dominant classes, whereas the rarer classes form a much smaller but longer tail. We lump these remaining documents into a single class “other”, which often don’t contain useful information for tax filing or bookkeeping purposes.

Figure 1: The sequential transfer learning process of the state-of-the-art CNN models (AlexNet, VGG-16, ResNext) pretrained on ImageNet. The network is first transfer learned on a large publically available data sets (RVL-CDIP), then fine-tuned on Intuit’s documents.

Our modeling approach leverages the power of convolutional neural networks (CNN). While these models were originally created for and mainly applied to natural images, recent studies have successfully demonstrated their potential for document classification² ³ ⁴ ⁵. Specifically, classification of documents greatly benefits from the technique of transfer learning, where CNN models are pre-trained on one or more large datasets before being fine-tuned on a target dataset. Models with pre-training have higher performance for the end task compared with their counterparts directly trained on the target, especially when the dataset used for pre-training is similar to the target (e.g., using a larger dataset of documents for pre-training if the end task is to classify documents in a smaller dataset)².

We implemented learnings from the literature² and pre-trained our CNN models twice, first using ImageNet ⁶, then using RVL-CDIP ⁷ (a dataset of 400,000 single-page grey-scale documents evenly divided into 16 classes). The models were then trained again to classify Intuit’s financial documents, which were separated by page (from multi-page PDF documents), resized, grey-scaled and normalized prior to training.

Despite the visual similarity between some of the financial documents (e.g., Form 1098 and Form 1099-R), the CNN models did a great job separating the classes. The validation accuracy across more than ten different classes is >90%, while end-to-end processing time is just over one second for the vast majority of the documents.

2. Layout Analysis

Information about documents’ layout structure not only provides more enriched metadata that can be used for organizing and indexing the information, but can also be used to enhance downstream ML-based information extraction tasks. The majority of financial documents contain a few broad types of layout structures: tables, forms, and free text.

The goal of the layout analysis is to detect and annotate the presence of these structures in each page of the document using table detection, form detection, and free text clustering algorithms. Once the documents have been annotated with these structures, the information can be extracted using structure-specific information extraction methods.

The structure detection can be done using the visual inputs (image pixels) or textual inputs (strings of optical character recognition (OCR) characters with bounding boxes). For the former, we feed a model with the visual inputs, which returns the annotation of tables and cells. Within tabular structures, information is organized by rows and columns and the key understanding can be derived by inferring the relationship between table cells to the corresponding header rows and columns.

Figure 2.1: The key task for layout analysis in a tabular structure is to associate the content in the table cells with the corresponding column and row headers, which specify the type of information contained within. For documents with simple tabular structures, this task can be achieved by inferring row and column orientation by table lines (if present) and text coordinates, font type, size, and styles.

For forms, the information is organized as key-value pairs. The key task in layout analysis is to identify the form fields and the unique key-value pairs associated with them.

Figure 2.2: The key layout analysis task for form structure is to locate important entities in the form. The majority of forms contain form fields that follow a key-value pair structure with a one-to-one correspondence between field title text (key) and field content text (value). The goal of key-value association is to aggregate the string of OCRed characters into key and value parts, and then pair the keys and the values together.

For free-form text, the underlying assumption is that there’s an inherent structure, even when the information is not in a table or form. Using character coordinates, font size, type, and styles, the OCRed outputs of free-form text can be clustered into blocks, paragraphs, lines, words, etc.

3. Information Extraction

To extract specific pieces of information from documents, we first use OCR to turn raw documents into an array of characters with bounding boxes. Combining the information from the layout analysis step with the OCR raw output, we can rearrange and organize the string of characters into words, lines of text, paragraphs, and blocks.

Figure 3.1: For a receipt, the key pieces of information are vendor, amount, date, and the last four digits of the credit card number. These pieces of information are essential for transaction matching and expense reporting in QuickBooks.

The next key task is to locate and accurately extract key pieces of information from the documents. This task is similar to a well-known subtask in natural language understanding (NLU): named entity recognition (NER). However, our financial information extraction is distinct from traditional NER in two main aspects: 1) financial document understanding seeks to extract unique named entities, which is determined by what they will be used for in downstream document management, and is therefore document-specific (Figure 3.1 and 3.2), and 2) there is a specific, set number of entities that can show up in each document.

Figure 3.2: Key pieces of information for tax forms are all of the entities in the form fields that need to be used as part of a tax preparation for filing, including tax amount, employer identification number (EIN), taxpayer identification number (TIN), and various tax detection codes and amounts.

Casting the financial document information extraction as a NER problem opens the door for applying NLP approaches. Our financial entity extraction approach leverages the parameter efficient transfer learning⁸ on top of a state-of-the-art NLP model called BERT ⁹ (Bi-directional-Encoder Representations from Transformer). To adapt the BERT base model, we used the pre-trained BERT sentence piece tokenizer¹⁰ and built a custom output token classification head on top of the pre-trained BERT base. The outputs of BERT are the token (subword)-level predictions and confidence scores, which are aggregated to achieve entity-level extraction. For tax information extraction, our fine-tuned BERT model achieves ~93% overall accuracy across all token classes.

Figure 3.3: State-of-the-art NLP models such as BERT can be applied to information extraction in a document understanding pipeline. In this example, the stream of OCRed text is broken down into subword tokens using the pretrained sentence piece tokenizer. Each of these tokens are associated with a corresponding latent token class. During training time, the sequence of token-label pairs is fed into the BERT model, which learns to associate the contextual embeddings of tokens with the labels.

Unique entities in financial documents have relationships to real-world knowledge, financial regulations, or tax laws. For example: 1) date data can contain only 12 possible months and 28–31 possible days, 2) out of all 100,000 possible 5-digit numeric strings, only 41,702 of those are valid US zip codes, 3) out of all 676 possible two-letter strings, only 69 are valid state abbreviation codes. We apply this kind of domain knowledge to constrain the information extraction problem, and post-process the extraction results.

4. Feedback Loop from User Interaction

Machine learning systems are never 100% accurate, especially under evolving learning task conditions. User correction is an important source of external information that helps reveal situations where document processing components are inaccurate, as well as opportunities to improve the systems.

User interaction session data are captured as event streams. For Intuit’s products, document-related events constitute a small fraction of all the events in the user interaction data. Searching for such events from the user interaction event stream is like looking for a needle in a haystack, making data gathering for ML difficult. Making event data useful for document understanding tasks requires converting the event data to a document-oriented format. The key challenge in using the user interaction data for improving ML is mapping the session-based user interaction events to the document-oriented data store (which we called Financial Document Platform — FDP).

Figure 4: A simplified schematic of a feedback processing pipeline for capturing user feedback from a user interaction event stream to the document oriented data store, and computing the document level comparison. The document level comparison result is published to a downstream event processing pipeline and stored in a relational database.

The form generic feedback loop code base is modular by design, including three main components:

A pluggable expressive filtering module that captures only document processing related events, such as classification or extraction
An event data transformation module, which turns raw event data into a common key-value format for comparison
A comparison module, which computes the pairwise field-level difference between values in a pair of documents in the common key-value format, and summarizes the difference in output key-value format for evaluating aggregate ML model performance.

The benefits of implementing the feedback loop are two-fold. First, user feedback data is captured and organized in a document-oriented format upon the arrival of the feedback event, bypassing the need for historical data processing. Second, the relational format of the feedback data opens the door to near real-time model evaluation, as well as to seamlessly gather specific data that addresses existing model performance gaps.

Closing thoughts

An important step toward powering prosperity for Intuit’s customers is “Never enter data.” In financial services, many important pieces of information are captured in financial documents. In this blog post, we describe the technologies inside Intuit’s Document Understanding Platform to unlock the information inside financial documents and make them available, useful, and searchable for our customers in a secure manner. We achieve this by using state-of-the-art machine learning models, automation, and unique financial domain knowledge to classify and extract pieces of information, making them available for downstream tasks, including bookkeeping, expense reporting, and tax preparation. We also implement a feedback loop to capture document-related user interaction data, which is used to continually retrain our classification and extraction models. All of these technologies are inside Intuit’s products today!

About the Authors:

Amar Mattey: Amar is a principal software engineer at Intuit and a leading member of Intuit’s Document Understanding Platform (DUP).

Joy Rimchala: Joy is a data scientist in Intuit’s Machine Learning Futures Group working on machine learning problems in limited-label data settings to help groups around Intuit get good performance out of ML systems more quickly and in a data-efficient manner. Joy holds a PhD from MIT, where she spent five years doing biological object tracking experiments and modeling them using Markov decision processes.

TJ Torres: TJ is a data scientist at Intuit, working on the Machine Learning Futures team tackling research problems in the areas of computer vision (CV) and natural language processing (NLP) in order to better customer experience within Intuit’s core products. After receiving his PhD in Physics he transitioned to data science. He has previously worked at StitchFix, building fashion recommendation models using computer vision to help understand visual style, as well as Netflix building predictive models to help automatically surface issues with sign-up conversion.

Xiao Xiao:Xiao is a data scientist in Intuit’s Artificial Intelligence (AI) team working on using ML to enhance customer experience. Xiao holds a PhD in ecology and a MS in statistics, where she applied statistical analysis to study ecological patterns at broad spatial and temporal scales.

References:

[1] Publication 6961 Calendar Year Projections of Information and Withholding Documents for the United States and IRS Campuses 2019 Update

[2] M. Z. Afzal, A. Kölsch, S. Ahmed, and M. Liwicki. Cutting the error by half: Investigation of very deep CNN and advanced training strategies for document image classification. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[3] A. Kölsch, M. Z. Afzal, M. Ebbecke, and M. Liwicki. Real-time document image classification using deep CNN and extreme learning machines. 2017 14th IAPR ICDAR.

[4] C. Tensmeyer, and T. Martinez. Analysis of convolutional neural networks for document image classification. 2017 14th IAPR ICDAR.

[5] A. Das, S. Roy, U. Bhattacharya, and S. K. Parui. Document image classification with intra-domain transfer learning and stacked generalization of deep convolutional neural networks. 2018 24th International Conference on Pattern Recognition (ICPR).

[6] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, L. Fei-Fei. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] A. W. Harley, A. Ufkes, and K. G. Derpanis. Evaluation of deep convolutional nets for document image classification and retrieval. 2015 13th IAPR ICDAR.

[8] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly. Parameter-Efficient Transfer Learning for NLP. 2019 International Conference in Machine Learning (ICML) : 2790–2799

[9] J. Devlin, M. Chang, K. Lee, K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2019 18th Annual Conference of the North American Chapter of the Association for Computational Linguistics-Human Language Technologies (NAACL-HLT): 4171–4186.

[10] T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing 2018 Empirical Methods in Natural Language Processing (EMNLP): 66–71.