Chargrid: Towards Understanding 2D Documents
Anoop Katti, Christian Reisswig, Johannes Höhne, Cordula Guder, Sebastian Brarda, Steffen Bickel, and Jean Baptiste Faddoul (Deep Learning Center of Excellence)
Content in a digital age comes in a variety of layouts, formats, and rich multimedia assets, which are all essential in understanding the textual information. Humans are capable of perceiving content and processing information in different ways. We can read texts closely in a linear way to fully grasp its content, but we can also skim-read or scan documents to extract specific information. When it comes to digital content where format matters and might even dictate the way we read through a document, our brains rewire and manage to cultivate new cognitive skills and abilities; allowing us to read in a tabular or non-linear fashion. For instance, we can quickly read through 2D documents such as presentations, multi-media blogs or websites and recognize the structure of the text and the logical flow of ideas amidst all the different assets within a non-linear layout. Can machine learning models also adequately understand and extract information from 2D documents?
Current State-of-the-art Approaches: Text vs. Pixels as Input
With the continuous progress in natural language processing (NLP) methods, machine learning models are now capable of understanding and extracting information from unstructured text such as books, news articles or short text snippets with almost human-level accuracy. However, when it comes to understanding structured or formatted text, i.e., 2D documents where the layout structure is crucial in understanding the semantics, machines are still not as versatile and adaptive as humans. They remain trapped in a ‘linear’ understanding paradigm. For example, current state-of-the-art NLP methods work solely on the text-level by processing documents in a sequential or linear nature. This method represents information in serialized lines and completely disregards the 2D layout. Therefore, in situations where the layout structure and positioning are indispensable in understanding the textual content, NLP methods create a jumbled sequence of characters; making it even harder to understand the document.
Computer Vision (CV) approaches, on the other hand, can process documents as images, using the pixel-level input. We can extract information from these documents by employing object detection and semantic segmentation tasks. This approach retains the 2D layout but operates only on the low-level of pixel units and not the textual content. Such approaches are suitable if we only need to analyze the document’s layout without understanding its textual content; somehow resembling a human trying to understand a document in a foreign language.
Therefore, standard NLP and CV methods impose an either/or mandate: working on the text-level for an in-depth understanding of the semantics but losing the 2D layout or working on the pixel-level of documents as images to retain the layout but lose the textual content. How can we blend these two approaches? In our recent EMNLP 2018 paper: Chargrid: Towards Understanding 2D Documents, we present a novel document representation approach: character grid or chargrid. Our approach makes use of a 2D grid of characters to preserve the 2D layout structure of documents; while simultaneously working on the textual content.
Chargrid: Building Character-Pixels
The first step towards building the chargrid document representation pipeline is mapping every character (i.e., alphabetic letter) to a constant numeric value. For example, the character “A” would be represented with an integer “1”, the character “B” would be integer “2”, “C” with integer “3” and so on. A document is composed of many characters, and each character within the document occupies a given space. What we do is we extract the characters along with their locations from the input document (using systems like OCR or pdf2text). Afterward, we create an empty canvas (i.e., chargrid). For each character that was extracted, we place its mapped integer onto the canvas in the region occupied by the character in the original document. With this, we manage to build up a 2D document representation in which the textual content is placed as grids of characters in its specific location; enabling us to work on the textual content within a document while preserving its 2D layout. We show an example of characters ‘Ch’ in the below picture.
Connecting the Dots: Understanding Documents with Chargrid
Document understanding is understanding the semantic content of a document at different levels: characters, words, paragraphs and layout elements. Equipped with the chargrid representation, we formulate the document understanding task as instance level semantic segmentation on chargrid. Instance level semantic segmentation produces two outputs that facilitate information extraction from chargrid: semantic segmentation and bounding boxes. Semantic segmentation recognizes or identifies the different classes or labels within a document; whereas the bounding boxes locate the multiple instances of the same class occurring in the document. We applied our approach to information extraction from invoices to test its ability to accurately extract key information from 2D documents of various layouts and formats.
Chargrid in Action: Application on Information Extraction from Invoices
Big multinational companies receive tens of thousands to several million invoices a year from various vendors across the globe. Therefore, they are in different languages, have different formats of dates, currencies, and taxation and come in various layouts. Extracting key information from these invoices is a complex and tedious task. We applied our formulation of document understanding (instance level segmentation on chargrid) for extracting useful information from invoices. To this end, we first compiled a dataset of scanned sample invoices; ensuring that this set contains invoices of various layouts, in different languages (such as English, French, German, Spanish and Norwegian) and with a maximum of six invoices from a single vendor. Afterward, we trained a model on this dataset to extract key information from invoices, i.e., header fields such as “InvoiceNumber,” “InvoiceDate,” “InvoiceAmount,” “Vendor Name,” and “Vendor Address” as well as line-items, including their associated price, description, and quantity. Below, we show some samples of chargrid and the output of the neural network.
We compared our results with a sequential model working only on the text-level and an image-based approach working on the pixel level. Chargrid and the sequential NLP approach performed equally well on simple fields (typically single-word) such as “Invoice Number” or “Invoice Amount.” For multi-word or bigger fields such as line items description, quantity or amount where the 2D layout and structure is crucial to accurately extract these fields, chargrid significantly outperformed the sequential model. On the other hand, chargrid still beats the image-based model especially on smaller fields, and complex extraction tasks that would require an understanding of the text.
We are still far from building ‘literate’ models that can capture the complexity and nuance of textual content in all its forms. Our approach presents a first step towards integrating the 2D layout structure into document understanding tasks; enabling a model to aptly grasp 2D documents with different layouts and extract relevant information. This new document representation is not, however, limited to invoices, but it can be applied to other document types such as resumes, contracts, reports, web pages, and scientific papers. We are also curious to test chargrid on other NLP tasks or scenarios where text and natural images are blended.
Check out our poster below presented at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2018). Fore more details, please refer to our paper and share with us your insights in the comments section.