Extracting structured data from scanned documents

Published in

Dashdoc

8 min readMar 10, 2022

At Dashdoc, our goal is to digitize the transportation sector. But our customers still have to deal with a lot of paper documents. Extracting informations from these documents could help us reduce manual processing.

The aim here was to get the weight of the load of a transport from a ticket picture taken by a trucker.

These weight notes can have various different layouts, often several weights are present on the tickets, they are not all in the same orientation, we have handwriting, blurred pictures, shadows…

The general approach was to proceed step by step, splitting the problem into successive subtasks:

Straighten the document
Extract the text from the image
Find the requested weight parts in the text
Rebuild the weight

Straighten the document

Thanks to a tool we use on the mobile side, most of the pictures are reframed around the document we want. But it can lead to documents with a 90° angle.

We chose to straighten these images as an automatic preprocessing. The idea was to make the task easier for the following steps.

To do so, we built an image classification model. We used transfer learning, taking a MobileNet model pretrained on ImageNet, on top of which we added some layers to fit our problem of 4 orientations (0°, 90°, 180°, 270°) classification. We trained these new layers on 1000 training tickets coming from our database, that we annotated with a home-made annotation tool, based on OpenCV. At the end of the training, we chose the model that gave the best result an on another 1000 validation tickets. This was done using Keras.

import tensorflow as tfidg = tf.keras.preprocessing.image.ImageDataGenerator() 
training_data_generator = idg.flow_from_dataframe(
    training_data, directory=output_path,
    x_col="filename", y_col="orientation",
    target_size=(224, 224),
)
validation_data_generator = idg.flow_from_dataframe(
    validation_data, directory=output_path,
    x_col="filename", y_col="orientation",
    target_size=(224, 224),
)inputs = tf.keras.Input(shape=(224, 224, 3))
scale_layer = tf.keras.layers.Rescaling(scale=1 / 127.5, offset=-1)
x = scale_layer(inputs)
base_model = tf.keras.applications.MobileNet(
    weights="imagenet",
    input_shape=(224, 224, 3),
    include_top=False,
  )
base_model.trainable = False
x = base_model(x, training=False)
x = tf.keras.layers.GlobalAveragePooling2D()(x)
outputs = tf.keras.layers.Dense(4, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss='categorical_crossentropy',
    metrics=['categorical_accuracy'],
)
checkpoint = tf.keras.callbacks.ModelCheckpoint(
        save_dir + "mobilenet.h5", 
        monitor='val_categorical_accuracy', verbose=1, 
        save_best_only=True, mode='max'
    )
callbacks = [checkpoint]
model.fit(
    training_data_generator,
    epochs=20,
    validation_data=validation_data_generator,
		callbacks=callbacks
)

Thanks to this, we managed to reduce the number of wrong orientations from 13.5% to 2.5% on our 1000 test tickets.

Extract the text

The following step was to extract the text from the image. This is the OCR (Optical Character Recognition) part. For this, we used the docTR package from Mindee. DocTR divides the task into two parts: first text detection (isolate words regions), then text recognition (identify characters). We use the default pretrained detection and recognition models provided by docTR.

from doctr.models import ocr_predictorocr_model = ocr_predictor(pretrained=True)

Find the requested weight

The next step was to find the good weight among all the extracted text. We did this part using token classification. It consists in assigning labels to individual tokens in a sentence. Tokens can be words, letters or subwords. Here we try to identify tokens belonging to two classes:

O, Outside of a named entity
W, net Weight entity

For this task we used a LayoutLM model, which is a transformer model that takes into account both the text and the layout of the document. This was done using the Transformers package from Hugging Face.

The LayoutLM model expects tokens and their boxes coordinates as input. So the first part of the token classification step is to build those tokens and corresponding boxes. In our case, tokens will correspond to subwords.

DocTR already gives us words and their bounding boxes.

First, LayoutLM expects the coordinates to be on a 0–1000 scale. Since docTR gives the boxes as relative coordinates, you just have to multiply the OCR result coordinates by 1000.

Then we have to split these words into tokens that match the ones the model has been pretrained on. This can be done by using the tokenizer corresponding to the model given by Transformers.

import numpy as np
from transformers import (
    LayoutLMConfig,
		LayoutLMTokenizer,
)
import tensorflow as tftokenizer = LayoutLMTokenizer.from_pretrained("microsoft/layoutlm-base-uncased")train_features = tokenizer(
    list(train_images_words_data_df["sentence"]),
    padding="max_length",
    truncation=True,
    return_tensors="tf",
).data

Extending the bounding boxes to these tokens must be done manually. This extend must take into account the special tokens that are added by the tokenizer, such as the tokens of start and end of the document, or the padding tokens (that are added so all our documents have the same size and can be processed by batch). We also have to take into account the possible truncation of the document (that is necessary to fit with the model maximum admissible input size). For training, labels must also be extended. We set the labels of all special tokens to -100 (the index that is ignored by the loss function) and the labels of all other tokens to the label of the word they come from.

config = LayoutLMConfig.from_pretrained("microsoft/layoutlm-base-uncased")# Build boxes and labels for tokens
max_tokens = config.max_position_embeddingsimages_token_boxes_list = []
images_labels_list = []
for _, row in images_words_data_df.iterrows():
    words = row["value"]
    normalized_word_boxes = np.transpose(
        [row["x0_scaled"], row["y0_scaled"], row["x1_scaled"], row["y1_scaled"]]
    ).tolist()
    labels = row["label"]
    token_boxes = []
    token_labels = []
    # Words tokens
    for word, box, label in zip(words, normalized_word_boxes, labels):
        word_tokens = tokenizer.tokenize(word)
        token_boxes.extend([box] * len(word_tokens))
        token_labels.extend(
            [label] * len(word_tokens)
        )
    # Truncation
    special_tokens_count = 2
    if len(token_boxes) > max_tokens - special_tokens_count:
        token_boxes = token_boxes[:(max_tokens - special_tokens_count)]
        token_labels = token_labels[:(max_tokens - special_tokens_count)]
    # Add bounding boxes of special tokens([CLS] and [SEP])
    token_boxes = [[0, 0, 0, 0]] + token_boxes + [[1000, 1000, 1000, 1000]]
    token_labels = [-100] + token_labels + [-100]
    # Padding
    padding_length = max_tokens - len(token_boxes)
    token_boxes += [[0, 0, 0, 0]] * padding_length
    token_labels += [-100] * padding_length    # Add image result
    images_token_boxes_list.append(token_boxes)
    images_labels_list.append(token_labels)train_features["bbox"] = tf.convert_to_tensor(train_images_token_boxes_list)
train_features["labels"] = tf.convert_to_tensor(images_labels_list)

Regarding the model, we fine-tuned the Transformers LayoutLM pretrained model on 135 tickets. As for the orientation model, these notes were annotated thanks to a home-made annotation tool, also based on OpenCV.

import tensorflow as tf
from transformers import (
    TFLayoutLMForTokenClassification,
)BATCH_SIZE = 2
EPOCH_NUMBER = 5train_tf_dataset = tf.data.Dataset.from_tensor_slices(train_features)
train_tf_dataset = train_tf_dataset.shuffle(len(train_tf_dataset)).batch(BATCH_SIZE)model = TFLayoutLMForTokenClassification.from_pretrained(
    "microsoft/layoutlm-base-uncased", num_labels=2
)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5))
model.fit(train_tf_dataset, epochs=EPOCH_NUMBER)

Rebuild the weight

After the token classification task, we have some tokens identified as weight. It is worth noticing that along the different steps, the weight can be split in several different words (by OCR task) and into different tokens (tokenization for token classification). In an ideal world, at this step we should just have to paste all the tokens identified as weight together to rebuild our weight. But in reality, this part can be messy.

For instance, tokens that are part of the requested weight could have been predicted as not weight (false negative), and on the contrary some tokens that are not part of the wanted weight can be predicted as weight (false positive). For the first issue, we tried to smooth the prediction by using the box of the tokens identified as weight: we consider that all the tokens inside a box are part of the weight if one is predicted as weight. And so, all the tokens of a box containing a weight token are merged together to constitute a partial candidate string.

Then, to handle the weight split into several words (this can for instance happen when there is a space after the thousands), we use the order of the boxes: if two consecutive boxes are considered as weight, we merge them.

After this, we have several potential weight strings that can contain any character. For each of them, we use a simple regular expression to extract the digits and add some basic business rules to compute a number, check if it can constitute a plausible weight and determine also the unit. We finally select one of the plausible weights.

Results

Thanks to this approach, we managed to achieve an accuracy of 74% on a test set of 1000 weight notes, excluding the 3% of tickets where the weight is not visible at all (weight is not on the picture, extremely blurry).

This result can already be useful for some Dashdoc features, as automatically validating the weight input by the trucker before invoicing, enabling our users to avoid opening the document to check it manually before making the invoice.

Here we built the frame of our solution, knowing each step can be improved. We could try other models for the orientation classification model. We used the raw models of docTR, but we could train them on our data to see if detection or recognition can be improved. For the weight identification part, we could train the model on more data or try to see if some business rules could be useful. And the final heuristic, which is really basic for now, could also be improved.

Extracting information from documents has become a common issue and the applications are huge. Here we chose a special use case to begin with, but it could be extended to a lot of other subjects at Dashdoc. Technically, it covers different very interesting technical domains. Dividing the problem into different parts highlights this and makes it easier to iterate. The results are promising, stay tuned for the next step: integrating the weight extraction features inside the product!