The Path to Identity Validation (1/3)

Using Deep Learning to Detect and Classify IDs

Published in

Signaturit Tech Blog

8 min readAug 6, 2019

Intro

If you are familiar with Signaturit Electronic Signature Service, you might know that we offer a suite of authentication systems that allow document senders to confirm the identity of the receivers and/or signers. These systems expand from voice verification to methods based on a document, to information that only the user holds (e.g. a file, an SMS, or an ID document).

Being able to automatically validate the information contained in an ID document is key to ensure that the signer of a document and the provider of the ID are the same person and the rightful signer.

One of the main use cases of OCR technologies applied to our context implies reading the Machine Readable Zone (MRZ) code contained in ID documents and passports, as this MRZ it’s unique for each person.

Reading the rest of text instances in a document, such as name, surname or birthdate, is also necessary in order to cross-validate the MRZ code with them. Moreover, the data contained in this code has to match with the personal info that the signer introduces in the document to be signed.

For this purpose, a photo of both sides of the signer ID must be provided together with the document signed, either by taking them with a device camera, or attaching and uploading them to the app. The information contained in the ID is checked against possible alterations or modifications. This verification process occurs automatically in the same signing process.

Machine Readable Zone (MRZ) highlighted in a Spain issued ID sample

If someone modifies some data on the ID document, the MRZ code will no longer match with the information scanned and it will not pass the validation process.

Given this scenario, an OCR solution becomes a solid authentication strategy to tackle it within our current signature validation ecosystem.

Divide and Conquer

Just put on your Machine Learning practitioner glasses and suppose a document signer uploads an image with an ID document in it. Which pieces, or stages, do we need to implement in order to complete the ID validation process puzzle? What you’ll see next will (probably not) surprise you:

Detect if there is an ID in the image and its location.
Classify the ID type and side of the document.
Detect the location of text fields in the ID.
Extract the content of the text fields.

The scope of this article is limited to the 1st and 2nd stages. If you want to dive into the Text Detection Stage, here is the link to our Second Article. Soon, you will also be able to learn on our Text Recognition solution.

Introducing SignaYOLO: Detection and classification all-in-one

Nowadays, if you put Object Detection and Machine Learning in the same sentence, you are most likely talking about a certain flavour of deep neural network architecture. Even though there is a vast variety of architectures for image recognition out there, each one with its strengths and disadvantages, most of them have something in common: they are equipped with convolutional layers.

Do not panic, if it’s the first time you heard about convolutional layers and CNNs and you’re wondering what the heck that is, here is an article explaining it in plain English.

For our purpose of detecting and classifying an ID contained in an image, we have created SignaYOLO, an adaptation of the version 3 of YOLO (You Only Look Once) architecture backed by darknet, open-source neural network framework.

Why YOLOv3 [1]? Because it is accurate and fast detecting and classifying objects in images and videos. Why darknet? Because it is written in C and CUDA, which makes it way faster than other versions written Darkflow or Keras.

We won’t get too technical on how and why SignaYOLO works, as it is not the goal of this post. Just getting the idea and concepts behind it is more than enough. If you want to dive into the details, here and here you can find more technical content.

The Basics

SignaYOLO is a 106 layer fully convolutional neural network architecture designed to detect and classify objects.

In order to train the network, it needs to be fed with samples, each sample is composed of an image and its annotations. These samples shape up our dataset, that will be split into a train set and a test set.

The annotations of an image are nothing more than a number specifying the class of the ID (a different class per type and side of the document), followed by another 4 numbers providing the location and dimensions of the object in normalized measures (location of the center of the ID, x and y, and the dimensions of it, width and height). These 4 numbers define the bounding box holding the ID.

Just to give you an idea, below are the annotations of an ID of class 1 (ID type 1, front side), centered in the middle of the image (x and y), and taking the 85% of it (width and height):

These samples are fed to the network in batches. The batch size is one of the many hyperparameters we can tune in order to help the network learn and generalize better.

One of the best-known rules of thumb in Machine Learning is: the larger the dataset, the better. So the more samples provided to the network in the training stage, the better it will learn to detect and classify them. Ideally, our train set will contain a minimum of 2000 samples per class and 4000 in total [2].

Do not worry if you don’t have the resources to get and/or annotate 4000 images, there are ways to artificially increase the size of your dataset using data augmentation techniques.

These techniques range from resizing the images randomly and periodically during the training, shifting their angle, or applying filters to them to modify colors, contrast, brightness and so on. You can think of it as pretty much what you do with your last holiday photos to make them look better before posting them on Instagram but just adapted to help the machine learn and generalize better.

The great benefit of this it that modification of an image turns it into a completely different sample for the machine. But it comes with a couple of trade-offs that must be taken into account: any error in the annotations of your training set will be amplified and will introduce noise in the learning process, and, unless you take measures on it, you will give up on reproducibility of the experiments, as the random nature of the image modifications will not allow repeating the exact same experiment twice.

The Training Stage

Once the training is launched, the network starts by resizing the samples, ideally to a square shape with dimensions multiple of 32, in order to better fit the model architecture. Then the resized images pass through the layers suffering the already mentioned transformations and resizings in order to highlight the features that better describe what it is contained in them.

Across the architecture, three detections are performed per training iteration, at layer 82, 94 and 106. Each one of these three stages aims to detect small, medium and big sized objects, respectively.

To better understand this setup, you can refer to the below diagram:

In each training iteration, the network will compare the detections and classifications obtained with the annotations (ground-truth) provided, calculate the difference or distance between them, and readjust the weights in each layer to lower this difference. Depending on the number of classes in your dataset, the size of it in terms of number of samples, and the processing power of your GPU, this process might take hours, or even days, to complete.

The Evaluation Stage

Once our network is trained, it is time to evaluate how good it is in detecting and classifying the IDs in images that it has never seen before. For this purpose we can use the following set of metrics on the detections the model has made on our test set:

Detection and Classification Evaluation Metrics

This evaluation log contains metrics with two different purposes: some like IoU (Intersection over Union) evaluates the detection performance, others like F1-score give us insights on the classification performance of the model. The goal is to train our model in order to optimize both types of metrics up to an acceptable benchmark. If you want to get technical on these metrics, here is a good read.

This is how a detection & classification looks in a sample image:

Detection Bounding Box and Class Predicted in a Spain issued ID sample

Conclusions

In this post we have scratched the surface of the process of adapting a state-to-the-art object detection algorithm to our ID validation use case. This is just a starting point, as it’s necessary a more technical dive into this technology in order to understand it and make the model learn and generalize up to an acceptable standard.

Thanks for reading! Hopefully, this post gave you an entry-point to the machine learning for computer vision thrilling universe.

To learn on the following stages of our ID Validation Solution you can jump into our second article: The Path to Identity Validation (2/3): How to start your own machine learning project?. There we describe the tech infrastructure behind, as well as our solution on the Text Detection in IDs.

In a future post, we will explain how to perform the last stage in the process: text recognition. Stay tuned!

[1] YOLOv3 paper: https://pjreddie.com/media/files/papers/YOLOv3.pdf

[2] When should I stop training: https://github.com/AlexeyAB/darknet#when-should-i-stop-training

About Signaturit

Signaturit is a trust service provider that offers innovative solutions in the field of electronic signatures (eSignatures), certified registered delivery (eDelivery) and electronic identification (EID).

Open Positions

We are always looking for talented people who share our vision and want to join our international team in sunny Barcelona :) Be a SignaBuddy > jobs

The Path to Identity Validation (1/3)

Using Deep Learning to Detect and Classify IDs

About Signaturit

Open Positions

Written by Pablo Reynel