Developing Automated KYC Verification

Hendra Hadhil Choiri
Bukalapak Data
Published in
5 min readFeb 6, 2020

How to Extract Customers Data from Their KTP Image?

What is KYC?

Have you ever signed up to an app, and you are asked to upload your image of ID Card (i.e. KTP) to upgrade your account? This kind of procedure is usually known as KYC (Know Your Customer). This is the process of a business verifying the identity of its customers. It is necessary to ensure the reliability of our customer base and reduce identity fraud.

At Bukalapak, the KYC is done by letting the users upload their KTP image and a selfie photo holding their ID Card. The user is verified by checking the identity data inside the KTP, and compare the KTP image in the selfie photo.

KYC Interface in Bukalapak

Automatic KYC Verification

The challenge in the KYC verification process is to make sure that the data inputted by the users are correct. Thus, we need to ‘read’ the data in the KTP Image and compare the KTP to the selfie photo. A conventional way to do this is by having CS agents to manually check the images. Unfortunately, it may require lots of time and resources. Thus, we need to automate this process. As a note, this article only focuses on the automation on extracting the data from KTP image.

Actually, there are some third parties out there which provide a service to automatically retrieve the data from KTP images and verify them, however it may charge us high cost. Thus, we developed the service on our own. The service we developed takes a KTP image as an input, and returns the user data in the KTP as the output (i.e province, city, NIK, name, birthplace and birthdate, gender, marital status, etc).

The architecture of this service is as the following:

Architecture of the KTP data extractor

Step 1: OCR (Optical Character Recognition)

To be able to extract the data, we should ‘read’ the texts inside the image. Thus, we need OCR. Instead of building an OCR system from scratch, we can use the existing one. This is a common topic and there is already a lot of research and libraries to do the OCR. In this case we use the Google Cloud Vision (GCV).

It is pretty easy to extract text from image by using the Vision API. For example if we have an image ‘sample_ktp.png’ and we already have an API key ‘my_gcvision_api_key.json’, we can just run the following python script:

from google.cloud import vision
from google.cloud.vision import types
from google.protobuf.json_format import MessageToDict
import sys
client = vision.ImageAnnotatorClient.from_service_account_file(
“my_gcvision_api_key.json”)
text_response = client.text_detection(image=image)
text_response = MessageToDict(text_response)
Dummy KTP image (sample_ktp.png). Note that it is not a real KTP, obviously.

After this process, we will get data of chunks of word inside the image, and their coordinate locations. So, for each word, there is a ‘description’ and four pairs of x & y. For the easy processing, we convert the data to be label, (x1,y1) to (x4,y4) as the four coordinates, w (width), and h (height). The example result can be seen in the following image:

Extracted texts via Google Cloud Vision and the converted format

Step 2: Entity Recognition

From the previous step, we only get information of what are the existing words and where are their positions. However, we don’t know which one represents the NIK, name, gender, etc.
Thus, the next process is to find out these entities. There are some approaches to do that. Previously, I used regex based pattern matching, but the result was not so good. So, the method I use is by searching the position of the entity field, and then find the corresponding value.

For example, we want to find the name. Firstly, we need to find where the word ‘nama’ is. We can directly filter the ls_word where label = ‘nama’. However, sometimes we find some typo in the result of OCR, e.g. ‘nama’ becomes ‘nma’. To handle that, I use levenshtein distance and find the word with the smallest edit distance to the searched word. The script is as follows.

field_keywords = ‘nama’ls_dist = [levenshtein(field_keywords, word[‘label’].lower()) for word in ls_word]index = np.argmin(ls_dist)
ls_word[index]

Now we get the location of the word ‘nama’. The next task is to find out its value, which is basically at the right of the word ‘nama’. We can just go through words having a similar y position. However, it will be problematic if the image is tilted. The example can be seen in the following image. We already found the position of the word ‘nama’. If we follow the horizontal line to the right (red), we get the NIK instead of the name. The better approach is to follow the direction of the text based on its gradient.

Here is the code to do this:

import mathdef calDeg(x1,y1,x2,y2):
myradians = math.atan2(y1-y2, x1-x2)
mydegrees = math.degrees(myradians)
mydegrees = mydegrees if mydegrees >= 0 else 360+mydegrees
return mydegrees
x,y = ls_word[index][‘x1’], ls_word[index][‘y1’]
w = ls_word[index][‘w’]
degree = calDeg(ls_word[index][‘x1’],ls_word[index][‘y1’],ls_word[index][‘x2’],ls_word[index][‘y2’])ls_y = np.asarray([np.abs(y-word[‘y1’]) < 300 for word in ls_word])value_words = [ww for ww, val in zip(ls_word,ls_y) if (val and np.abs(calDeg(x,y,ww[‘x1’],ww[‘y1’])-degree)< 3)]value_words = [val for val in value_words if len(val[‘label’].replace(‘ ‘,’’).replace(‘:’,’’))>0]field_value = “”
for val in value_words:
field_value = field_value + ‘ ‘+ str(val[‘label’])
field_value = field_value.lstrip()

With only these simple steps, we can already retrieve most of the existing entities in the KTP. Of course there are some post-processings required to get cleaner and more accurate results.

Python Code in Github

The complete basic script of this KTP extraction tool is available in the following github repository: http://github.com/bukalapak/KTPextractor. This is one of the open source projects we have in Bukalapak. You can also contribute and help us to improve this service.

Evaluation

By using this service, the KYC verification can be done quickly (50X faster! we could process up to 50.000 KYCs in one day, instead of only 1.000 per day by manual checking). We can also directly detect KYC images which are problematic (e.g. not a KTP image, image is blurred / unreadable, data in KTP is incomplete, etc). Based on our internal evaluation, among the successful KYC, this service managed to successfully detect 85% of the data. So, this service is pretty efficient to replace humans verifying the KYC process :)

--

--