WACV 2024: Intuit AI Research Develops End-to-End Method for Document Enhancement using Diffusion Models

Jiaxin Zhang
Intuit Engineering
Published in
5 min readJan 4, 2024

This blog is co-authored by Jiaxin Zhang, staff research scientist, Joy Rimchala, principal data scientist, Lalla Mouatadid, staff research scientist, Kamalika Das, manager, AI Research Program and Kumar Sricharan, VP and chief architect for AI at Intuit

Consumers and small businesses grapple with a large number of physical documents on a daily basis such as receipts, invoices, and tax forms, most of which need to be converted to a digital file before processing for tax and accounting purposes. However, these critical documents are often degraded or damaged in various ways, which can significantly impair the performance of optical character recognition (OCR) systems.

OCR systems convert images of typed, handwritten, or printed text into machine-encoded text, which heavily relies on document image quality. This is crucial for automatic document processing and document intelligence, which aims to enhance document quality using advanced image processing techniques such as denoising, restoration, and deblurring.

Document enhancement tasks include denoising, shadow removal, binarization, watermark removal, deblurring, and defading.

However, applying these techniques directly to document enhancement may not be effective, due to unique challenges posed by text documents. Unlike typical image restoration tasks, where the degradation function is known and the recovery of the image task can be solved by inpainting, deblurring/super-resolution, and colorization, real-world document enhancement is a blind denoising process with an unknown degradation function, making it even more challenging. Many state-of-the-art methods have been proposed that rely on assumptions and prior information, but there is still a need for more effective techniques that can handle unknown degradation functions.

Furthermore, the task of document enhancement presents several unique challenges, including:

  • High resolution, since many documents are up to 2048x2048 resolution, previous methods are struggling with scalability, which can lead to performance degradation and a significant increase in training costs.
  • Lack of large benchmark datasets, which means it’s not feasible to use large pre-trained models for document understanding. While the success of large generative models such as Stable diffusion, Dall·E, and Imagen is largely attributed to large datasets, such as LAION-5B, there is currently no commercially available or open source large pre-trained model available for document-level tasks.
  • Character feature damage, because (unlike image translation at the pixel level) document-level image translation requires preserving original content, such as characters and words, while accounting for style differences. Current methods only focus on pixel-level information and do not consider critical character features such as glyphs.

To date, most existing document enhancement methods require supervised data pairs, which raises concerns about data separation and privacy protection, and makes it challenging to adapt these methods to new domain pairs.

With these challenges in mind, our AI Research Program team here at Intuit has developed an unsupervised end-to-end document-level image translation method for document enhancement, which we’re presenting this week at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024 conference. For a deep dive into the results of our research, see our “Document Enhancement using Cycle-Consistent Diffusion Models” paper here.

Following is a high-level synopsis for reference:

First, we needed to solve three core problems: unpaired supervision, data privacy protection, and cycle consistency enforcement. Next, we introduced data augmentation strategies for overcoming a lack of large benchmark datasets, while improving character and word feature preservation.

Inspired by recent advances in diffusion models, we use an approach that independently trains the source (noisy) and target (clean) models, decoupling paired training and enabling the domain-specific diffusion models to remain applicable to other pairs. Our document enhancement using cycle-consistent diffusion models (DECDM) research is based on denoising diffusion-implicit models (DDIMs) to create a deterministic and reversible mapping between images and their latent representations using ordinary differential equation (ODE). Translation with DECDM on a source-target pair requires two different ODEs: the source ODE encodes input images to the latent space, while the target ODE decodes images in the target domain.

Since training diffusion models are specific to individual domains and rely on no domain pair information, DECDM makes it possible to save a trained model of a certain domain for future use, when it arises as the source or target in a new pair. Pairwise translation with DECDM requires only a linear number of diffusion models, which can be further reduced with conditional models. Additionally, the training process focuses on one dataset at a time and does not require scanning both datasets concurrently, preserving the data privacy of the source or target domain. We also introduce simple data augmentation strategies to improve character-glyph conservation during translation. Notably, our method is also scalable, handling high-resolution images via a sub-sampling strategy and diffusion models.

When comparing DECDM with state-of-the-art methods on multiple synthetic data and benchmark datasets, such as document denoising and shadow removal, DECDM demonstrates superior performance, quantitatively and qualitatively. DECDM provides an unsupervised end-to-end solution for document image enhancement that offers several advantages over today’s advanced methods, including adaptability to new domain pairs and data privacy protection. These unique capabilities make DECDM a more robust, safe, and scalable solution for improving OCR performance in document automation intelligence.

In future work, we aim to build a large pre-trained diffusion model that incorporates large document datasets to address the current limitations caused by data sparsity, augmentation, and character/word context recognition. This will allow us to further improve the performance and efficiency of DECDM and advance the field of document image enhancement.

_________________________________________________________________

Intuit’s AI Research Program is an intrapreneurial function within the company that pushes the boundaries of AI. We develop and incubate AI-driven technology breakthroughs to solve our customers’ most important financial problems.

We’re a diverse team of research scientists, data scientists, and engineers with extensive expertise in AI, including natural language processing, generative AI, robust and explainable AI, symbolic AI, machine learning, and optimization.

To connect with us about open roles, partnerships, or collaborations, contact ai-research@intuit.com

--

--