Information Extraction on Distorted Documents with YoloV5, Rectinet and Tesseract

Nicolas Jomeau
ELCA IT
Published in
8 min readMay 24, 2022

You surely already have used your smartphone to scan documents. Be it with a specific camera mode or some app such as Acrobat Reader, these techniques often have the same constraints: the documents must be flattened and well exposed. This adds friction to the scanning process by forcing the user to correct the shape and exposure of its paper sheet and thus would benefit from automation. Being able to automatically perform these corrections without this added friction could be then used to create a faster feedback loop with the user: e.g. extract relevant information from a salary statement and bank reports to estimate the maximum loan a client could get.

In this article, we will build a fully end-to-end pipeline with deep learning models to extract a document from a picture, flatten it, and finally extract information from it. To flatten the document, we will need to infer its shape and to extract information from it, we will have to extract document objects from it (such as paragraphs, figures, etc...).

The Dataset Problem

In order to work, deep learning models must ingest a lot of training data to understand and learn the input domain they will be working with. As we are working on 2 different tasks (object detection for document and information extraction, shape detection for flattening the document), we need a dataset with appropriate training and targets features.

Unfortunately, such a dataset simply doesn’t exist and would be very costly to create: while annotating document objects is straightforward (e.g. Mechanical Turk), creating a dataset with the shape of distorted documents is impossible without appropriate hardware (e.g. Lidar). Fortunately, the state-of-the-art in document shape correction is now using the doc3D synthetic dataset as well as its renderer. By taking a scanned document, a (distorted) paper mesh and a background, this renderer generates realistic looking samples as well as a mapping of the distorted document’s coordinates in the picture (also called UV mapping).

Using the renderer with the object annotated FUNSD documents, doc3D meshes, and Laval Indoor HDR backgrounds, we generated 2000 pictures and UV mappings (1500 for training, 500 for testing).

The doc3d renderer produces a realistic looking sample and a UV mapping with 3 channels for the document X/Y coordinates and whether the document is present or not.

The Pipeline

Now that we have some data to train on, we can start building the actual pipeline. It can be described in a series of 4 steps:

  • Locate and crop the document in the picture
  • Detect and correct its shape
  • Apply some post-processing to get a “scan-like” document
  • Detect interesting objects in it (forms, text, tables, …)

Document Cropping

The first step is to locate and extract a document page from the picture in order to give the most valuable information possible to the shape correction algorithm. For this, we use one of the most popular Object Detection models: YOLOv5. This model was chosen as it is extremely lightweight (2M parameters with its nano variant) and thus can perform real-time detection even on limited hardware. A demo is even available for iPhone and Android!

As of now, our dataset doesn’t contain annotation boxes for the whole document in order to train the YOLOv5n model. We fix this by using the UV mappings: if a pixel is colored, then the document is present in this pixel. By finding the minimum and maximum X/Y coordinates of these pixels, we can create the rectangular bounding box required for our training. For the next step, we also add some classes depending if the document is already flat (but warped due to perspective) or bent by simply looking at the renderer parameters used.

We then train a YOLOv5n models on the train/test set with an input size of 320px, default YOLOv5 config (300 epochs), and data augmentation with flip, scaling, mosaic tilling, as well as some hue/saturation/value randomization. As this task is fairly simple compared to the usual COCO object detection, we obtain a highly performant detection model with precise and accurate inference: mAP@0.5:0.95 is capping at 0.999 and the model only does 2 false classifications on the test set.

Detecting and classifying documents with YOLOv5 works well even on real-life data!

Shape Correction

With the document now extracted, we can start correcting its shape. For this step, we use the RectiNet CNN model. It is based on the famous U-Net medical segmentation model and works in 2 parts: a first U-Net takes the input image (resized to 256px) and creates a 256x256x50 features map, and a second U-Net uses these features to infer the actual shape of the document. This model doesn’t output a UV mapping as we have generated but a BW mapping instead. The BW is simply the inverse operation of the UV mapping where each pixel stores the position of the distorted document in the original picture and can be used with Pytorch’s grid_sample to flatten the document.

We can retrieve a perfectly flat document using Pytorch’s grid_sample and a BW mapping (the colored square).

We generate these BW target features from our dataset and finetune the already trained model for 100 epochs. As the model is pretty beefy with 80M parameters, we try reducing the number of convolutional channels or the input size to speed up its inference. These experiments didn’t give satisfying results but we believe it’s rather due to the fact we don’t have enough training data than a limiting architecture issue as these variants have overall the same MSE at end of training.

One observation from using RectiNet was that it would introduce “waves” when correcting an already flat document. To correct this behavior, we make use of the classes detected by the YOLO model in the first step: if the document is flat, we use a simple CNN to detect the document’s corners and then “pull” them to get back to a rectangular shape instead of applying RectiNet. It allows the shape correction process to be 20x faster than RectiNet while having more qualitative results.

Post-processing

While we (hopefully) correctly flattened our document, the result is still not anywhere near something you would get after using a scanner. Due to lighting conditions and bents, a white document can have its color altered and some of its content shadowed.

There exist some well-known algorithms to correct such artifacts. For color, we can use color balancing to bring back the altered white to a perfect white. The algorithm works by rescaling each color channel’s 95th percentile to the maximum intensity value. If you look at the distribution of channel’s values, it effectively brings them closer to maximum intensity and thus gives perfect white as all channels have similar values. Note that this balancing works well in our case as we are only applying it on mostly white documents. In the case of more diverse documents, it would require a more sophisticated method, such as manually selecting a pixel that acts as a reference white to get qualitative results or using a DNN to correct the color mapping.

For shadows, we use some median filtering to create a shadow map of the document which can then be used to remove the shadows by subtracting it from the document. As this operation darkens the document, we rescale the pixels’ channels again between the minimum and maximum intensity values. As for the white balancing, this method is highly destructive for large regions of color and thus should only be used on B&W documents.

Our input image (left) after white balancing (middle) and shadow removal (right).

Information Extraction

Last but not least, we want to extract meaningful information for automated processing from the corrected document. For this we use YOLOv5 again, but this time with a beefier config: we take the S variant (8M parameters) with an input size of 800px as we will be looking for small details. We train the model directly on the FUNSD dataset (150/49 train/test samples) to detect text, signatures and figures.

For such a simple task, we get some unconvincing results: precision is at 0.84, recall is at 0.78 (we are missing 22% of interesting objects!) and mAP is pretty low at 0.84 and 0.488 for the mAP@0.5 and mAP@0.5:0.95 respectively. In this case, it is often a good idea to see how the model is working and to do qualitative (instead of quantitative with metrics) visual analysis of the results. Doing so, we found out that the model is actually working fine but the low metrics were due to inconsistently annotated documents in the FUNSD dataset: paragraphs are sometimes split in multiple line objects (but detected as a whole paragraph), some objects such as checkboxes are randomly given annotations, and some annotation boxes are straight-up incorrect. Interestingly, and despite the “garbage in, garbage out” saying in machine learning, the YOLO model is able to not overfit on these wrong annotations.

Ground-truth annotations (blue) vs YOLOv5 detected objects (red) on a FUNSD sample. Ground-truth is incoherent with paragraphs split into lines, missing annotated text and checkboxes sometimes having their own annotation

Finally, if an object is detected as text, we run it through Tesseract’s OCR LSTM to extract its content. With a character-error rate of 0.27, we are in line with the results of the other state-of-the-art shape correction models. We nonetheless think this result could be improved were the FUNSD samples available in higher resolution (currently only 1000x700px).

Conclusion

By designing and building a full end-to-end pipeline, we are now able to efficiently scan documents even in the worse picture and paper conditions. By automating information extraction, additional algorithms or models can be plugged on top of the pipeline to further improve its capabilities: e.g. find someone’s name, address and salary on a salary statement for an insurance company without requiring the user to get a dedicated scanner.

Real-life (and unseen) input samples and corrected outputs of the pipeline. Results are encouraging but we can see the limitations of the shadow removal algorithm with colored regions on the right sample.

Overall, the pipeline could still be improved: RectiNet can be trained with more samples (synthetic data generation takes time) to try smaller and more efficient variants, and some steps like shadow removal could be done using deep learning too (median filtering also takes time). Also, the next step would be to put such model in production following MLOps principle to access to niceties of retraining, automated deployment or monitoring.

--

--

Nicolas Jomeau
ELCA IT
Writer for

Data Engineer at ELCA. I love experimenting new ML approaches and data engineering techniques!