An Introduction to Vision Transformers for Document Understanding
Here at Unstructured, we use advanced document understanding techniques to help data scientists extract key information from PDFs, images, and Word documents. The goal of this blog post is to provide an overview of the document understanding models that power our open source core library.
Document understanding algorithms analyze the content of documents with an encoder-decoder pipeline that combines computer vision (CV) and natural language processing (NLP) methods. The CV part of the pipeline analyzes the document as an input image to produce a representation that a transformer can process, similar to how an NLP model processes tokens. In the figure below, the CV model generates an image embedding that is fed into a multimodal transformer.
Traditionally, convolutional neural networks (CNNs) such as ResNet have dominated the CV field. Recently, however, vision transformers (ViTs) similar to NLP architectures such as BERT have gained traction as an alternative approach to CNNs. ViTs first split an input image into several patches, convert the patches into a sequence of linear embeddings, and then feed the embeddings into a transformer encoder. This process is depicted in Figure 2. The linear embeddings play a role similar to tokens in NLP. As with NLP models, the output of the transformer can be used for tasks such as image classification.
ViTs have several advantages over CNNs. ViTs can grasp global relations and appear more resilient to adversarial attacks. A disadvantage is that more examples are needed for training ViTs because CNNs have inductive biases that allow training them with fewer examples. However, we can mitigate this issue by pre-training vision transformers with large image data sets. ViTs are also compute intensive — the amount of compute required to run transformers grows quadratically with the number of tokens. Vision Transformers are now available as part of HuggingFace Vision Encode Decoder models, as shown in the snippet below.
from transformers import BertConfig, ViTConfig, VisionEncoderDecoderConfig, VisionEncoderDecoderModel
config_encoder = ViTConfig()
config_decoder = BertConfig()
config = VisionEncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
model = VisionEncoderDecoderModel(config=config)
Vision Encoder Decoders provide the foundation for many document understanding models. The Donut  model first processes an input image with an an image transformer and then feeds it to a decoder to generate a structured representation of the input document. In the example below, we provide the image of a receipt and output a structured JSON containing containing the line items of the receipt.
Whereas some document understanding models such as LayoutLMv3  require preprocessing to identify bounding boxes and perform OCR, Donut converts the input image directly into the target JSON, as shown in the code below. A downside to this approach is that the output does not include the bounding box, and therefore does not provide any information about where in the document the extraction came from.
from donut.model import DonutModel
from PIL import Image
model = DonutModel.from_pretrained("./custom-fine-tuned-model")
prediction = model.inference(
"InvoiceId": "# 560578",
"VendorName": "THE LIGHT OF DANCE ACADEMY",
"VendorAddress": "680 Connecticut Avenue, Norwalk, CT, 6854, USA",
"AmountDue": "Balance Due:",
"CustomerName": "Eco Financing",
"customerAddress": "2900 Pepperrell Pkwy, Opelika, AL, 36801, USA",
"Description": "FURminator deShedding Tool",
"Description": "Roux Lash & Brow Tint",
"Description": "Cranberry Tea by Alvita - 24 Bags",
The Unstructured team is currently working on pipelines that use Donut to extract structured information from receipts and invoices. Stayed tuned in the coming weeks as we prepare to release these models on GitHub and Huggingface. Follow us on LinkedIn and Medium to keep up with all of our latest work!