Part 2: Information Extraction from ID Documents with Donut 🍩

Paul Lefevre
6 min readSep 22, 2023

--

This two-part series was written as part of an intership in the DataLab of the French Digital Agency for Law Enforcement (ANFSI). The ANFSI leverages OCR techniques to speed up the processing of ID documents, bank details, etc.

This article was written with the help of this article by Neha Desaraju.

Photo by NajlaCam on Unsplash

This tutorial is the second part of a two-part series, where we learn how to extract information from ID documents such as name, date of birth, etc.

In Part 1, we look at how to detect an ID document in an image using YOLOv8, crop the image around it and correct the orientation.

In Part 2, we train a Donut model to extract information from the ID documents obtained in Part 1.

Prerequisites

For this second part, you will need a dataset of ID documents (at least 100–150 samples) that have been preferably cropped and correctly oriented (refer to Part 1!). It would be better to have a balanced number of each type of document you have (e.g. as many passports as ID cards). Ideally, you also have a file containing the ID information for each sample. If not, well, I’m sorry to tell you that you will have to manually annotate at least 100–150 samples πŸ˜‰.

You will also need access to a system equipped with a GPU to train the model. If you don’t have one on your machine, I recommend using Google Colab. If you choose to use Google Colab, make sure to enable GPUs for the notebook by navigating to Edit > Notebook Settings and selecting GPU from the Hardware Accelerator drop-down.

What you will learn

In this tutorial, we are going to train a Visual Document Understanding (VDU) model, Donut 🍩, to extract information from images of identity documents without OCR preprocessing.

The code in this tutorial was written and tested using the following module versions:

  • Python 3.8
  • torch==2.0.1
  • torchvision==0.15.2
  • donut-python==1.0.9
  • pytorch-lightning==1.8.5
  • transformers==4.11.3
  • timm==0.5.4
  • Pillow==9.5

1. Labeling your data

If you already have a file containing the ID information from the documents you have, you can skip to step 2. Otherwise, the first step of this tutorial is to create such a file. I would recommend using a spreadsheet and filling in the information for each ID document, like the following:

Labeling your data in a spreadsheet

Next, export your spreadsheet to a labels.csv file. Make sure to label at least 100 to 150 images in order to have decent results.

2. Preparing your dataset

Then, you have to process your labels so that they are in the format expected by Donut. The structure of the dataset for Donut is the following:

data/
β”œβ”€β”€ train/
β”‚ β”œβ”€β”€ image1.jpg
β”‚ β”œβ”€β”€ image2.jpg
β”‚ .
β”‚ .
β”‚ β”œβ”€β”€ metadata.jsonl
β”œβ”€β”€ val/
β”‚ β”œβ”€β”€ image151.jpg
β”‚ β”œβ”€β”€ image152.jpg
β”‚ .
β”‚ .
β”‚ β”œβ”€β”€ metadata.jsonl
β”œβ”€β”€ test/
β”‚ β”œβ”€β”€ image231.jpg
β”‚ β”œβ”€β”€ image232.jpg
β”‚ .
β”‚ .
β”‚ β”œβ”€β”€ metadata.jsonl

The metadata.jslonl files contain a label on each line; each line has the following structure:

{"file_name": "image1.jpg", "ground_truth": "{\"gt_parse\": {\"surname\": \"Sharapova\", \"name\": \"Maria Yourievna\", \"sex\": \"F\", \"birthday\": \"04/19/1987\", \"birthplace\": \"Niagan\"}}"}

Here is a Python code that splits your images into three sets, train, val, andtest, and creates the corresponding metadata.jsonl:

import json
import os
import shutil
from tqdm import tqdm
import pandas as pd

def create_sets(df, train=0.7, val=0.2, test=0.1):
"""
train, val and test are the proportions of the images
that go in each split.
"""
if train + val + test != 1:
raise ValueError("train + val + test != 1")

# Create the folders
train_folder = "data/train"
val_folder = "data/val"
test_folder = "data/test"
os.makedirs(train_folder, exist_ok=True)
os.makedirs(val_folder, exist_ok=True)
os.makedirs(test_folder, exist_ok=True)

# Shuffle your data
samples = df.sample(frac=1.).reset_index(drop=True)

# Compute the number of image for each split
n = len(samples)
n_train = train * n
n_val = val * n
n_test = test * n

for idx, row in tqdm(samples.iterrows(), total=samples.shape[0]):
data = {
"surname": row["surname"]
"name": row["name"]
"sex": row["sex"]
"birthday": row["birthday"]
"birthplace": row["birthplace"]
}
file_name = row["filename"]

gt_parse = {"gt_parse": data}

line = {
"file_name": file_name,
"ground_truth": json.dumps(gt_parse)
}

# We assume that your images are in
# a folder named "images/"; correct if necessary
image_path = os.path.join("images", file_name)

# Copy the image in one of the folders
# and append a line to metadata.jsonl
if idx < n_train:
dest_path = os.path.join("data/train/", file_name)
shutil.copyfile(image_path, dest_path)
with open("data/train/metadata.jsonl", "a") as f:
f.write(json.dumps(line) + "\n")

elif n_train <= idx < n_train + n_val:
dest_path = os.path.join("data/val/", file_name)
shutil.copyfile(image_path, dest_path)
with open("data/val/metadata.jsonl", "a") as f:
f.write(json.dumps(line) + "\n")

elif n_train + n_val <= idx < n_train + n_val + n_test:
dest_path = os.path.join("data/test/", file_name)
shutil.copyfile(image_path, dest_path)
with open("data/test/metadata.jsonl", "a") as f:
f.write(json.dumps(line) + "\n")

df = pd.read_csv("labels.csv")
create_sets(df)

3. Setting up Donut 🍩

Let’s now set up Donut! First, install the donut-python module with pip:

pip install donut-python

You also need to clone the repository in your working directory because we will need some files from it.

git clone https://github.com/clovaai/donut.git

Next, you need to prepare the configuration file that will be used for training. For the information extraction task, we will start with the train_cord.yaml configuration file found in the config/ folder of the Donut repository. Copy it to your working directory and rename it as you want (maybe something like train_id.yaml).

Then, you have to change some lines in this file:

dataset_name_or_paths: ["path/to/your/dataset/folder"]

train_batch_sizes: [2] # We lowered to number to 2 because of memory limitations on our machine

num_training_samples_per_epoch: 800 # Set it to the number of training images you have

max_epochs: 10 # Choose a number of epochs

warmup_steps: 400 # 10% of total steps, equals to num_training_samples_per_epoch / train_batch_sizes * max_epochs / 10

When loading the model in Python, Donut will download models from HuggingFace. You might encounter an error if the network policy of your company doesn’t allow downloading files with Python (mine does). To circumvent this issue, you need to manually download a model from the HuggingFace website. Go to this link, make sure you are on the official branch (and not main) and download all the files. Put them in the donut-base folder. Finally, change the following line in your configuration file:

pretrained_model_name_or_path: "path/to/donut-base/folder"

4. Fine-tuning Donut 🍩

Now it’s time to train our model! This step is very straightforward; just run the train.py script from the Donut repository you cloned:

cd donut && python train.py --config train_id.yaml

You might need to change some lines in the train.py file if you encounter errors related to your machine. In my case, I had to comment out strategy="ddp" (line 146).

5. Run inference

Once you have trained your model, you can run some inference on an image with the following code snippet:

from donut import DonutModel
from PIL import Image
import torch

# Change the path here:
model = DonutModel.from_pretrained("path/to/result/train_id/20230911_140235")
if torch.cuda.is_available():
model.half()
device = torch.device("cuda")
model.to(device)
else:
model.encoder.to(torch.bfloat16)

model.eval()

image = Image.open("path/to/image.jpg").convert("RGB") # Change here
with torch.no_grad():
output = model.inference(image=image, prompt="<s_data>")

output

You also need to change the inference prompt to <s_{dataset_folder_name}>. Running this code on this (fake) document:

I got the following result (I’ve replaced the single quotation marks with double ones for syntax highlighting):

{"predictions": [{
"surname": "Berthier",
"name": "Corinne",
"sex": "F",
"birthday": "12/06/1965",
"birthplace": "Paris 1er (75)"
}]}

Training also produces a *tfevents* file so that you can plot the training loss and validation loss curves with TensorBoard. I obtained the following validation loss curve:

Validation loss for our training

You can also get an accuracy score on your test set by running the following line:

python test.py --dataset_name_or_path path/to/your/data/folder --pretrained_model_name_or_path path/to/your/trained/model/folder --save_path ./result/output.json

What’s next?

Et voilΓ ! You have a model capable of extracting information from ID documents! We found that Donut was a very effective model at this task, performing the best among all the methods we have tried: we computed an average character error rate on our test set of less than 3%.

You can read more about Donut in the official paper.

What is the next step? It is now up to you! For example, you can try building a demonstrator webapp using Dash. Good luck!

--

--