DocPath: Revolutionizing Document Information Extraction with Fine-Tuned Large Language Models
In today’s data-driven world, enterprises handle large volumes of documents every day — ranging from legal contracts to invoices, financial reports, and healthcare records. Traditionally, processing these documents has been a manual and time-consuming task. However, with advancements in AI and natural language processing (NLP), organizations are turning to automation to streamline document handling and extract meaningful data efficiently. One of the emerging tools in this domain is DocPath — a fine-tuned large language model designed specifically for information extraction from documents.
The objective of the UiPath Research team is to build the best AI models for enterprise automation, with a focus on developing production-ready AI models and continuously improving their capabilities. This article will introduce UiPath DocPath, a new large language model (LLM) designed specifically for extracting information from documents.
What Is DocPath?
Documents play a crucial role in enterprise processes, enabling efficient information transfer between people and systems while supporting workflows and record-keeping. However, businesses face challenges in processing vast quantities of documents in various formats (structured, semi-structured, and unstructured) at speed and scale.
To address this challenge, the UiPath Research team developed DocPath, revealed at the UiPath AI Summit. DocPath is the new foundational model for UiPath Document Understanding, a platform capability for intelligent document processing. It allows businesses to efficiently process documents like tax forms, invoices, purchase orders, and financial statements out of the box.
DocPath is designed for the specific task of information extraction from documents. Unlike general-purpose AI models like OpenAI’s GPT, DocPath is tailored to enterprise needs. During development, several architectural decisions were considered, including choosing between decoder-only or encoder-decoder models, both of which come with trade-offs in terms of efficiency and performance.
Experiments were conducted with both architectures using a dataset of over 100,000 high-quality semi-structured documents, including invoices, receipts, and purchase orders. After fine-tuning models like Mistral 7B and Llama-2–7b, the Google FLAN-T5 XL encoder-decoder model was selected for DocPath. The decision was based on key factors:
- Encoder-decoder models demonstrated superior performance on fact-based tasks with limited solution spaces, such as information extraction.
- T5 offered pre-trained models in smaller parameter sizes, enabling easier experimentation.
- The instruction-tuned datasets from FLAN-T5 were publicly accessible, providing useful resources for further tuning and training.
Prompt Design: Enhancing Accuracy and Efficiency
In contrast to earlier Document Understanding models that relied on token classification using encoder-only transformers, DocPath uses a prompt and completion approach, outputting structured JSON for extracted fields.
Positional tokens, generated from each OCR box, are embedded into the prompt to provide both the content and the position of the text. For example:
Prompt: “Given the following text on a semi-structured document along with coordinates, extract the following fields: invoice-id, invoice-date, total, net-amount.”
Text: <CL1> <CX23> <CY25> Invoice <CX25> <CY25> 235266 <CL2> <CX24><CY30> Date <CX34><CY32> 24/1/2023
Target: {"invoice-id" : <CL1> <CX25> 235266 , "invoice-date" : <CL2> <CX34> 24/1/2023}
CL tokens represent line numbers, while CX and CY denote the x and y coordinates of each word. This positional grounding ensures that DocPath can attribute extracted fields back to their original locations in the document accurately.
Enhancing Table Extraction
DocPath is designed to handle structured data, including tables. Prompts are designed to extract columns and rows in a single request. For example:
Target: {"line-amount" : {"0" : "<CX27><CY34> 20", "1": "<CX29><CY38> 25"} , "description" : {"0": "<CX16><CY32> Item1", "2": "<CX13><CY32> Item2"}}
This approach ensures accurate extraction of tabular data, such as line items in invoices, improving efficiency in handling structured documents.
Inference Optimization
Optimizing inference for real-time document processing is critical, particularly when dealing with large documents containing multiple fields. To improve the speed and accuracy of inference, fields are divided into buckets, and multiple prompts are run in parallel. After testing various inference engines, CTranslate2 was selected for its efficiency and seamless integration with the existing codebase. This solution improves decoding throughput, while confidence scores are assigned to the extracted fields to increase reliability.
Developing DocPath: Continuous Improvement
DocPath development is an ongoing process, with continuous experimentation to enhance performance. The current architecture, which includes positional tokens for grounding, has proven effective. Additional experiments are underway to incorporate image inputs using patch embeddings and layout information from models like LayoutLMv3, with the goal of improving accuracy by integrating document image pixels into the model.
Larger versions of FLAN-T5 are also being explored, alongside potential decoder-only approaches, to determine their effectiveness in information extraction tasks.
Conclusion
DocPath, is a fine-tuned large language model that provides powerful capabilities for extracting information from documents. By utilizing prompt-based completions and novel architectural designs, DocPath offers businesses a solution for automating document workflows with increased speed and accuracy.
DocPath is part of the first generation of fine-tuned LLMs developed by UiPath Research, alongside CommPath, a model focused on processing business communications.
(**This article is based on insights shared by UiPath. Read more about UiPath DocPath here.)
