Part 1: ID Documents Detection with YOLOv8 and Orientation Correction

Paul Lefevre
10 min readSep 22, 2023

--

This two-part series was written as part of an intership in the DataLab of the French Digital Agency for Law Enforcement (ANFSI). The ANFSI leverages OCR techniques to speed up the processing of ID documents, bank details, etc.

Photo by Kelly Sikkema on Unsplash

This tutorial is the first part of a two-part series, where we learn how to extract information from ID documents such as name, date of birth, etc.

In Part 1, we look at how to detect an ID document in an image using YOLOv8, crop the image around it and correct the orientation.

In Part 2, we train a Donut model to extract information from the ID documents obtained in Part 1.

What you will learn

In this tutorial, we are going to train a deep learning model (YOLOv8) for the task of detecting ID documents in an image (a photo or a scan). Once they have been detected, we are also going to correct the orientation of the documents, if needed.

The code in this tutorial was written and tested using the following modules versions:

  • Python 3.8
  • torch==2.0.1
  • ultralytics==8.0.119
  • Pillow==9.5
  • opencv-python==4.8.0.74

Prerequisites

For this first part, you will need a dataset of ID documents (at least 100–150 samples) in an image format (e.g., JPG). If you need to convert PDF scans into images, you can use the pdf2image module.

You will also need access to a system equipped with a GPU to train the model. If you don’t have one on your machine, I recommend using Google Colab. If you choose to use Google Colab, make sure to enable GPUs for the notebook by navigating to Edit > Notebook Settings and selecting GPU from the Hardware Accelerator drop-down.

1. Labeling your data (e.g. with Label Studio)

Unless you are very lucky, the data in your hands likely did not come with detection labels, i.e. bounding box coordinates for the ID document in each image. So the first step of our work is to label the images in order to obtain training data. Luckily, you won’t have to annotate a ton of images; in our tests, we found that with 120–150 images we achieved good results.

There are many labeling tools on the Internet. I personally used Label Studio, an open-source data labeling platform. Here is all the information you need to get started with Label Studio. An advantageous feature of Label Studio is the ability to export your labels in YOLO format.

Once you’ve launched Label Studio, create a new project. Name it whatever you want, and in the “Labeling Setup” tab, choose “Computer Vision > Object Detection with Bounding Boxes”. You can remove the default labels “Airplane” and “Car”, and add one label for each type of document in your data. In our case, we had two types of documents, ID cards and passports, so two labels. When you are done, click Save.

Creating a new project on Label Studio
Creating a new project in Label Studio

Next, import your data using the “Import” button. Then, start labeling your images by clicking “Label all tasks”. Labeling is simple: you select a label by clicking on it or pressing the associated shortcut (usually a number key) and draw a box around your document. Then, click the “Submit” button to go to the next image.

As we said before, you will need at least 120–150 labeled images. Also, try to make sure to maintain a balance between the numbers of each class (e.g., a balanced number of passports and identity cards) in order to avoid the class imbalance problem.

Once you have labeled a sufficient number of images, you can export your images and labels by clicking the “Export” button. Make sure to choose the “YOLO” format! You can then extract the file in your project folder.

2. Preparing your dataset

The next step is to arrange your dataset in the format supported by YOLO. First, make sure your labels are in the YOLO format:

Labels for this format should be exported to YOLO format with one *.txt file per image. If there are no objects in an image, no *.txt file is required. The *.txt file should be formatted with one row per object in class x_center y_center width height format. Box coordinates must be in normalized xywh format (from 0 to 1). If your boxes are in pixels, you should divide x_center and width by image width, and y_center and height by image height. Class numbers should be zero-indexed (start with 0). (Source)

If you followed the previous step with Label Studio, there is nothing to do here!

Next, you need to split your dataset into a train set and a validation set and arrange your dataset folder so that it has the following structure:

data/
├── train/
│ ├── images/
│ │ ├── sample_1.jpg
│ │ ├── sample_2.jpg
│ ├── labels/
│ │ ├── sample_1.txt
│ │ ├── sample_2.txt
│ .
│ .
├── val/
│ ├── images/
│ │ ├── sample_3.jpg
│ ├── labels/
│ │ ├── sample_3.txt

About 80% of your data should go in the train set and 20% in the val set. You can draw inspiration from the following script to split your dataset:

import os
import random
import shutil

def split_dataset(images_folder, labels_folder):
"""
Split the data in a `train` set and `val` set.
"""

# Create the train and val folders
train_images_folder = os.path.join(data_path, "train/images")
train_labels_folder = os.path.join(data_path, "train/labels")
val_images_folder = os.path.join(data_path, "val/images")
val_labels_folder = os.path.join(data_path, "val/labels")
os.makedirs(train_images_folder, exist_ok=True)
os.makedirs(train_labels_folder, exist_ok=True)
os.makedirs(val_images_folder, exist_ok=True)
os.makedirs(val_labels_folder, exist_ok=True)

# Get the list of image files and shuffle it
image_files = os.listdir(images_folder)
random.shuffle(image_files)

# Split the list
num_images = len(image_files)
num_train = int(num_images * 0.8)
train_image_files = image_files[:num_train]
val_image_files = image_files[num_train:]

# Move the files in their corresponding folder
for image_file in train_image_files:
shutil.move(os.path.join(images_folder, image_file), os.path.join(train_images_folder, image_file))
label_file = image_file.replace(".jpg", ".txt")
shutil.move(os.path.join(labels_folder, label_file), os.path.join(train_labels_folder, label_file))

for image_file in val_image_files:
shutil.move(os.path.join(images_folder, image_file), os.path.join(val_images_folder, image_file))
label_file = image_file.replace(".jpg", ".txt")
shutil.move(os.path.join(labels_folder, label_file), os.path.join(val_labels_folder, label_file))

print("Dataset split completed.")

Finally, you need to create data.yaml file based on the following format:

train: data/train/ # path to the train folder
val: data/val/ #path to the val folder

names:
0: ID card
1: Passport

You can put this file at the root of your project. Make sure the class number matches the corresponding class name. If you created your labels using Label Studio, the labels are listed in alphabetical order (if you are unsure of the order, check it in the classes.txt file).

3. Training a YOLOv8 model

Finally, we arrive at the step where we train our model. For the detection task, we chose to use Ultralytics’ YOLOv8 neural network, available in the ultralytics python module. Make sure you have installed it by following the instructions here (to install a module in Google Colab, run the following command in a cell: !pip install module_name).

YOLOv8 comes in different sizes and versions, depending on the performance required or the task in hand (detection, segmentation, classification, pose). Our task is detection, and the nano model (YOLOv8n) is sufficient in our case. Download the corresponding model from the link above and put it in the root directory of your project.

Then, run the following code snippet to load and train the model. Make sure to adjust the number of epochs if needed.

from ultralytics import YOLO

# Load a COCO-pretrained YOLOv8n model
model = YOLO('path/to/yolov8n.pt')

# Train the model on your dataset for 100 epochs
results = model.train(data='path/to/data.yaml', epochs=100, fliplr=0)

The fliplr=0 ensures that the model doesn't flip the image horizontally when applying data augmentation: indeed, horizontally flipped identity documents don’t exist, so it doesn't make sense to train the model to detect such images.

The results of training are saved in a runs/detect/train folder (or runs/detect/trainXX where XX is the number of your training). There, you can find metrics about your training (results.png, PR_curve.png).

4. Visualizing the results

The fun part is, of course, running inferences on your own images. Running inference on an image is simply done by first making sure you load the best model resulting from training, and then calling the model on the image:

# Load the best model
model = YOLO("runs/detect/trainXX/weights/best.pt") # Adjust trainXX
results = model("path/to/image.jpg")

Here is some code to help you visualize the result of a prediction by drawing boxes with labels and confidence on an image. You will need to adjust the list of label names, the list of colors, and the path to a font file (I downloaded Space Mono from Google).

from PIL import Image, ImageDraw, ImageFont

def draw_boxes_on_image(image_path):
"""
Parameters:
image_path
Returns:
image: image with boxes drawn on it
"""
image = Image.open(image_path)

# Run the inference and retrieve the boxes
results = model(image)
predictions = results[0].boxes.data.tolist()

label_names = ["ID CARD", "PASSPORT"] # Change if needed
colors = ["orange", "blue"] # Change if needed

draw = ImageDraw.Draw(image)
font_path = "SpaceMono-Regular.ttf" # Change if needed
font = ImageFont.truetype(font=font_path, size=24)

for prediction in predictions:
x1, y1, x2, y2, confidence, label = prediction
label = int(label)
# Draw the box
draw.rectangle([(x1, y1), (x2, y2)], outline=colors[label], width=5)

# Draw the text with the label name and confidence
text = f"{label_names[label]} ({confidence:.3f})"
text_width, text_height = font.getsize(text)
text_x = x1 + 5
text_y = y1 + 5
draw.rectangle([(text_x, text_y), (text_x + text_width, text_y + text_height)], fill=colors[label])
draw.text((text_x, text_y), text, font=font, fill=(255, 255, 255))

return image

image_with_boxes = draw_boxes_on_image("path/to/image.jpg")
image_with_boxes.show()

Running a similar code on a sample image, I got the following result:

Detection of a (fake) passport

As a side note, the coordinates of the boxes obtained by running inference on an image (for each box the coordinates are those of the top left corner and bottom right corner) are expressed in pixel coordinates and not image relative coordinates, contrary to the labels.

You can also retrieve the document as a list of np.ndarray representing cropped images with some Python code like the following:

import numpy as np

def retrieve_documents_from_image(image_path):
results = model(image_path)
predictions = results[0].boxes.data.tolist()

im = Image.open(image_path)
im = np.array(im)

docs = []

for prediction in predictions:
pred_class = int(prediction[-1])
x1, y1, x2, y2 = prediction[:4]
docs.append((im[int(y1):int(y2), int(x1):int(x2)], pred_class))

return docs

5. Adjusting the orientation of the image

Lastly, we want to make sure that the cropped documents that we obtained thanks to our YOLOv8 model are correctly oriented. We will make the reasonable assumption that the documents are oriented 0°, 90°, 180° or 270° to the horizontal.

A first solution to this problem would be to train another YOLOv8 model to classify the documents in one of the four classes that are the angles. That would require us to label another set of documents with the corresponding class. Though I’m pretty sure that such a method would yield great results, this isn’t what we’re going to do here. But don’t hesitate to give it a try if you feel like it!

What we’re going to do is something simpler that still gives good results, based on Otsu’s thresholding method. Let’s get right into it: the following image is a French ID card to which we have applied Otsu’s thresholding:

(Fake) ID Card with Otsu’s thresholding

Do you notice something interesting? The left part of the image, where the photo is, contains a lot more white pixels than the right part of the image, where the text is! With that in mind, the procedure to correct the orientation of an image is very simple!

Let’s assume that we have a cropped image obtained with our previous YOLO model. We first compare the width and the height of the image; if the height is greater than the width, that means that the image is oriented vertically, so we rotate it by 90° to have a horizontally oriented image. Then, we apply Otsu’s thresholding to the image and compare the number of white pixels on the left and right halves of the “thresholded image”. If the right half has more white pixels than the left half, it means that the image needs to be rotated by 180°.

Here is some Python code that performs this procedure using OpenCV. The input is a np.ndarray image.

import cv2

def compare_white_pixels(image):
"""
Returns True if the left half of image
has more white pixels than the right half

Parameters:
image : np.ndarray
"""

width = image.shape[1]
left_region = image[:, :int(width / 2)]
right_region = image[:, int(width / 2):]

left_white_pixels = np.sum(left_region == 255)
right_white_pixels = np.sum(right_region == 255)

return left_white_pixels > right_white_pixels

def rotate_if_necessary(image):
"""
Resets an image that has been rotated 90°/180°/270°.

Parameters
image : np.ndarray
"""

# Reset image to horizontal position if necessary
if image.shape[0] > image.shape[1]:
image = cv2.rotate(image, cv2.ROTATE_90_CLOCKWISE)

# Convert image to grayscale
image_binary = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)

# Apply some Gaussian blur and then Otsu's thresholding
image_binary = cv2.GaussianBlur(image_binary,(5,5),0)
_, image_binary = cv2.threshold(image_binary, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Rotate image by 180° if necessary
if not compare_white_pixels(image_binary):
image = cv2.rotate(image, cv2.ROTATE_180)

return image

Here is an example document:

A fake French ID Card, rotated 90°

We run this image through the following pipeline:

image_path = "path/to/image.jpg"
docs = retrieve_documents_from_image(image_path)
document = rotate_if_necessary(docs[0])

import matplotlib.pyplot as plt
plt.axis("off")
plt.imshow(document) # Display the document
plt.imsave("id_doc.jpg", document) # Save the document

And here is the resulting image:

What’s next?

Congratulations! You now have a model that detects ID documents on an image, and you are able to correct the orientation of the document if necessary!

You can head over to Part 2, where we will see how to extract information from those documents.

--

--