How we are Making your Rich Documents Talk

Imagine a bustling hospital where medical records are instantly processed, allowing doctors to make quicker and more accurate diagnoses, or a manufacturing company that automatically analyzes supplier contracts to optimize costs. This is the transformative power of harnessing scanned document data in the enterprise.

While our enterprise-grade product Document AI is in private preview, we at Snowflake are pushing further, delving into the fundamental questions that drive the field of document understanding:

  1. How can models not only extract and interpret the textual content but exploit other visual cues, such as layout (page structure, forms, tables), non-textual elements (marks, checkboxes, figures), and style (e.g., typography, highlighting)?
  2. How far can we push domain generalization? Can a model trained on medical records adapt well to manufacturing diagrams or tabular information?
  3. How can we handle the high variability of contents and layouts in real-world documents that often lead to highly imbalanced samples within document types? These long-tailed distributions are particularly challenging where limited samples are available for model training.
The input in the Document Visual Question Answering consists of a question and file (typically a document image). The system is expected to provide answers in natural language.

We’re excited to share some of the research we’ve recently presented at ICCV, one of the leading venues for computer vision. In particular, we present a new dataset specifically designed to answer the big questions we’ve posed, as well as to further the state of the art in document understanding and the emerging area of document visual question answering.

Datasets like ours have proved one of the key engines of progress in the age of AI — focusing corporate, government, and academic research on standardized measures of progress, while developing AI technology that can handle the deep real-world nuances.

Here comes the DUDE

The Document Understanding Dataset and Evaluation (DUDE) we introduced, involves novel kinds of questions, answers, and document layouts based on various document types, sources, and dates in multiple domains (including medical, legal, technical, and financial).

For example, we introduce complex questions requiring comprehension beyond the document content, such as ‘How many text columns are there?’ or ‘Which page contains the largest table?’ These layout-navigating questions bridge the gap between Document Layout Analysis and Document Question Answering paradigms.

Suppose you ask, ‘How many pages have a signature?’ given some real-world document. Answering requires visual comprehension (recognition of signature), knowledge about layout conventions (what the page is), and the ability to count, which modern models struggle with.

Moreover, we provide questions demanding arithmetic and comparison operations and feature multi-hop questions that indicate a model’s robustness to sequential reasoning.

The question ‘What is the difference between how much Operator II and Operator III make per hour?’ requires table comprehension, determining relevant values, and dividing extracted integers.
To answer ‘Which states don’t have any marijuana laws?’ the model has to visually comprehend the map and link knowledge from its legend with depicted regions.

Overall, we provide over 40k human-made annotations prepared for 4k multi-page PDFs (both born digital and scanned). Our dataset contains abstractive and extractive answers, including yes/no, list outputs, or unanswerability statements (i.e., we demand that the model correctly identify that the answer cannot be provided). Importantly, more than 90% of our questions are unique, since our target scope is more diverse than in previous works.

Additionally, we gather diagnostic metadata for the documents and QA pairs in the test set — these are intended to enable a fine-grained analysis of the models’ performance.

Full details can be found in the paper.

But LLMs!

The community already recognized the dataset and the proposed evaluation procedure. In particular, DUDE was one of the shared tasks featured by the ICDAR conference earlier this year. Participating researchers showcased intriguing model extensions, such as combining models that learn strong document representations with the strengths of recent large language or vision-language models.

Results per diagnostic category. The average ANLS metric for humans and the best-performing models evaluated in the paper.

Nevertheless, the performance of current state-of-the-art models, including LLMs, lags far behind human baselines, which highlights the fact we proposed a longstanding, challenging benchmark that requires more holistic and efficient modeling of language, vision, and richly structured layouts.

The ‘Beyond Plain-text World’ awaits.

--

--

Lukasz Borchmann
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Machine Learning Researcher with a primary focus on Natural Language Processing and Document Understanding.