How to Easily Create Custom Dataset for Document Understanding Transformer (Donut)?

Yana Stamenova
7 min readJan 14, 2023

--

Photo by Fahim mohammed on Unsplash

Hey guys, this is my first article on Medium. I am diving into Deep Learning but still learning and have a long way to go to consider myself a true professional. Nevertheless, I hope you will find the following text useful and I would appreciate good recommendations for improving my code.

I have recently faced one of the biggest fears of Deep Learning engineers — the lack of adequately formatted and tagged data for the model of the heart. I have the task of reading data from documents that are in image format. That would mean OCR the image and then “understanding” the text. And because life is not that easy, I only have the scanned images, no tags, no labels, or bounding boxes.

While I was doing my research about the task I discovered the Donut — OCR-free Document Understanding Transformer (more info here). In short, the model uses an image Transformer encoder and an autoregressive text Transformer decoder and could be applied to tasks such as document image classification and form understanding. Of course, I was impressed, because I found all I needed in one model. Moreover, this model seems to need very little — just an image and ground truth data from this image.

And here comes the sad part, I only have the images. So I started thinking (not that I did not think before :D), what can I do to create a dataset to be able to test this new great Transformer?

After some head-banging-against-the-wall time, I finally got it. I would use some of the old goodies (Tesseract) to create a custom dataset for the new model.

Just a little addition here - I will not be using my dataset because it’s private and the documents contain sensitive information. For this article, I created a small Kaggle notebook that contains part of the code and used the Kaggle dataset Invoices — Certificate of origin.

Step by step, ooh baby

Photo by Brett Jordan on Unsplash

Image tagging
For this part, I used an annotation tool called doccano. It is open-source and very easy to use after installation. You can use any other tool that suits you. What I get from doccano is jsonl (JSON lines) formatted file that contains coordinates (x1, y1) of the upper left corner of the bounding box together with the height, width, file name, and id of the image. Below is the code for the postprocessing of the doccano tag information:

import re
import os
import json

import numpy as np
import pandas as pd

import cv2
import PIL
from PIL import Image, ImageDraw

import pytesseract
from pytesseract import Output

f = open('/kaggle/input/tags-coordinates/tags-coordinates.jsonl')
image_tags = json.load(f)

flatten = pd.json_normalize(image_tags, meta=["id", "filename"], record_path=["bbox"])

flatten.rename(columns={"x": "x1", "y": "y1"}, inplace=True)

After renaming is done, it is time to get the other important coordinates:

flatten['x2'] = flatten['x1'] + flatten['width']
flatten['y2'] = flatten['y1'] + flatten['height']

And create a more organized data frame (just for peace of mind):

final_coordinates = flatten[['id', 'filename', 'x1', 'y1', 'x2', 'y2', 'width', 'height', 'label']]

Crop images
Now it was time to crop each image into smaller images, containing only the information I need — the Ground Truth (could be only lowercase but I like the importance of it).

def crop_image(row):
"""Crop image based on coordinates.
"""

file_name = row['filename']
image = cv2.imread(f'{MAIN_FILE_PATH}/{file_name}')

x1 = round(row['x1'])
y1 = round(row['y1'])
x2 = round(row['x2'])
y2 = round(row['y2'])

cropped_image = image[y1:y2, x1:x2]

return cropped_image

OCR it!
The previous step is immediately followed by OCR-ing of each cropped image and putting it in its place — linked to the specific image-tag row in my data frame. Notice that the ‘lang’ in the code is set to ‘bul+eng’. The reason is that the original documents I work with are in the Bulgarian language. Depending on your needs, feel free to make some changes here.

custom_config = '--oem 3 --psm 6 -c preserve_interword_spaces=1'

def get_image_text(image, config=custom_config):
"""OCR image to get text from it.

Args:
image (_type_): input image of a document
config (str, optional): Settings for tesseract OCR. Defaults to custom_config.

Returns:
str: Extracted text from image
"""
# Convert to gray scale
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
ocr_output = pytesseract.image_to_string(gray_image, lang='bul+eng', config=config)

return ocr_output
final_coordinates['text'] = final_coordinates.apply(
lambda row: get_image_text(crop_image(row)), axis=1)

# Edit output text to remove newline string, double quotes and extra whitespace
# ---- before json-ing the text
final_coordinates['text'] = final_coordinates['text'].apply(lambda row: row.strip())
final_coordinates['text'] = final_coordinates['text'].apply(lambda row: row.replace('\n', " "))
final_coordinates['text'] = final_coordinates['text'].apply(lambda row: row.replace('"', " "))
final_coordinates['text'] = final_coordinates['text'].apply(lambda row: re.sub(' +', ' ', row))

Reasoning
Some might argue that Tesseract returns coordinates and it is a bit of an overkill to get tags from another source and then crop images and extract text out of it. But in my case, I have documents that are pretty diverse in their formatting and the way things are named. For example, the words “buyer” and “recipient” can be used for the same element of the document. Also, Tesseract does not work well with tables and separating sections which also poses a challenge. Finally, you could say I can try to match coordinates from doccano with the tesseract coordinates. Well, I don’t think I could tag precisely the coordinates the way tesseract does it.

Let’s see the output of this operation and compare it to the tagged image:

source_img = Image.open('/kaggle/input/invoices-certificate-of-origin/FORM-B-Certificate-of-Origin.jpg').convert("RGBA")

draw = ImageDraw.Draw(source_img)
for ind, row in final_coordinates.iterrows():
draw.rectangle(((row['x1'], row['y1']), (row['x2'], row['y2'])), outline='cyan')

source_img
Cyan-colored bounding boxes.
Seems like Tesseract worked pretty well.

Time to output
Finally, the data was converted into jsonl format, ready to be given to a Donut.

def get_info_by_id(df: pd.DataFrame, id: int) -> pd.DataFrame:
"""Return data from table by id.

Args:
df (pd.DataFrame): input dataframe
id (int): object linked id in the dataframe

Returns:
pd.DataFrame: DataFrame slice
"""
df_new = df[df['id'] == id]

return df_new

def tryloc(df, col_input, col_output, value, default=None) -> str:
"""Return text from one column based on another column value.
If there is any issue with col_output, return default value.

Args:
df (pd.DataFrame): input dataframe
col_input (str): column whose values to check
col_output (str): column whose values to take based on col_input condition
value (str): value to look for in col_input
default (str, optional): if any problem with col_output return dafault. Defaults to None.

Returns:
str: output from condition
"""

try:
return df.loc[df[col_input]==value, col_output].iloc[0]
except IndexError:
return default

def add_to_dict(df: pd.DataFrame) -> dict:
"""Format dataframe data into dict.

Args:
df (pd.DataFrame): _description_

Returns:
dict: _description_
"""
dct = dict()

dct['recipient'] = {
'doc_type': tryloc(df, 'label', 'text', 'DOCUMENT_NAME'),
'issue_date': tryloc(df, 'label', 'text', 'DOCUMENT_DATE'),
'supplier_name': tryloc(df, 'label', 'text', 'SUPPLIER_NAME'),
'recipient_name': tryloc(df, 'label', 'text', 'RECIPIENT_NAME')
}

return dct
ids_list = final_coordinates['id'].unique()

# To save the data ready for output
dataset_donut = []

for id in ids_list:
df_segment = get_info_by_id(final_coordinates, id)
dataset_donut.append({
'id': id,
'file_name': final_coordinates.loc[final_coordinates['id']==id, 'filename'].iloc[0],
'ground_truth': {
'gt_parse': add_to_dict(df_segment)
}})


# Extract in jsonl format

with open('/kaggle/working/tesseract-ocred.jsonl', 'w', encoding='utf8') as outfile:
for entry in dataset_donut:
json.dump(entry, outfile, ensure_ascii=False)
outfile.write('\n')

The latter results in the following:

{'id': 50975,
'file_name': 'FORM-B-Certificate-of-Origin.jpg',
'ground_truth': {'gt_parse': {'recipient': {'doc_type': 'CERTIFICATE OF ORIGIN',
'issue_date': 'SEP. 26,2013',
'supplier_name': '‘SHENZHEN NICE PIT IMP & EXP CO.,LTD.',
'recipient_name': 'E-MART CO.,LTD'}}}

Donut it!

There are specific requirements for an image dataset to be used by a transformer model from Hugging face. You can see the documentation here. First, the data must be organized in a directory as follows:

folder
------train
-----------metadata.jsonl(txt,json)
-----------doc1.jpg
-----------doc2.jpg
------test
-----------metadata.jsonl(txt,json)
-----------doc3.jpg

Each train/test/valid folder should contain metadata with the ground truth for the images in the respective folder.

Assembling part

If you work in Google Colab it is very important that you install several libraries including ‘sentencepiece’:

!pip install -q git+https://github.com/huggingface/transformers.git datasets sentencepiece seqeval
!pip install pytorch-lightning
import datasets
from datasets import load_dataset, load_from_disk
from transformers import VisionEncoderDecoderConfig
from transformers import DonutProcessor, VisionEncoderDecoderModel, BartConfig

Then, load the dataset using the directory organized in the before-mentioned order:

dataset = load_dataset("imagefolder", data_dir="/content/drive/MyDrive/image-dataset/folder")

# Good if you save it too
dataset.save_to_disk("/content/drive/MyDrive/image-dataset/huggingface-dataset")

In my case, I have train and test folders and no validation folder. The result looks like this:

Well, this is pretty much it for the first article. I hope it was helpful and interesting. Next time I plan to give you more info about the Donut Transformer and how I fine-tune it for document parsing tasks.

Take care :)

--

--