The Role of Machine Learning in e-KYC

Published in

Data Folks Indonesia

4 min readApr 14, 2022

In this article, we will talk about the machine learning applications that often used in e-KYC (Know Your Customer). e-KYC is usually a process of user registration to verify if the user is actually the user who registered. e-KYC involves inputs identity card, selfie photos, and other legal documents. This documents are then sent into verification process and match the user data to government data (in Indonesia called Dukcapil).

There are some applications that I often applied in e-KYC process.

Optical Character Recognition (OCR) & Information Extraction

Optical Character Recognition has been long existed in the machine learning field. The task is to extract the text in the identity card such as id number, name, place of birth and date, address, etc. then, identify each extracted text into information extraction model which classify which text belongs to its fields and automatically feed into registration form. The goal is to minimize the effort of user inputs and human error as well as accelerate registration process.

In a perfect example, the input image is something like this. The user usually uploads the image by using handheld camera.

https://disdukcapil.cilacapkab.go.id/ktp-3/

The text detection part highlights the location of the text

And the annotation is look like this. A list of dictionary that contains text, confidence score from text detection, bounding boxes, and category. There are two ways to identify entity in a text. One, you can map the coordinate as in key value pair. Because as seen in the image, there is always information after the text. Two, you can do as in text classification, given a text and coordinate classify the text belongs to which category. NAME-B means the text is the name in the identity card and B means beginning and I means inside (to identify if the name is after beginning) as in NER BIO.

To measure the performance OCR identity card, I usually set the metrics to exact match. Exact match means that every text should correctly predicted as the ground truth. In some cases, you may find character error is fine. Let’s take a look of the example of identity number. In Indonesia, it always 16 digits, and each parts represents information. One single digit misclassified, you can identify to a different person. Therefore, exact match is used for this measurement.

Model

To build a proof of concept model, you can try to use tesseract-ocr. It has been developed over so many years and now it has reach 5th version. They provide pre-trained models and there is a python wrapper to use the model.

Challenges

In a perfect scenario, a simple model may works. In my experience, there are a lot of challenges in building OCR especially for e-KYC such as:

Image Blur
Low quality image
Distortion
Low contrast
Image rotation

Each of this challenges should be address correctly, if not, then the model may perform poorly.

Face Image Search

Face image search is a task where the model is trying to find the given query image into existing dataset. The task is used to check whether the user is new in our database. You may wonder why if the user has already registered and create new account again. The marketing team is usually running a campaign that specifically targeted user acquisition, say if you bring 2 your friends register then you will get $10. This promo is then abused by some people to get the money by using other people identity card but still uploading their personal picture.

CP-mtML: Coupled Projection multi-task Metric Learning
for Large Scale Face Retrieval. Bhattarai et. al.

Conclusion

I know it is far from perfect, but this may a good start to build the project and start from there.