Machine Learning @ DKatalis: Generating Synthetic Data with Photoshop and Python For Great Good!

Benjamin Tan Wei Hao
DKatalis
Published in
9 min readSep 21, 2021

--

Machine learning is an expensive affair. Training models costs money, and even more when GPUs are involved. However, as most companies would delve into any non-trivial machine learning would find out, it is the data that accounts for a large part of the costs.

In this post, I detail how I used Photoshop and Python to generate thousands (10k+) of synthetic Indonesia identification cards to train an OCR segmentation and recognizer deep learning model.

The Problem Domain

This is an example of an Indonesian identification card, also known as Kartu Tanda Penduduk, or KTP for short.

Source: https://m.suarasindo.com/read-2236-2019-07-28-masyarakat-jangan-sembarang-unggah-data-ktpel-dan-kk-di-internet.html

As part of the on-boarding process for the Jago Bank application, we want to allow users to upload pictures of their KTP, and have the important information immediately pre-populated in the next step, such as NIK (identification number), the name, and address.

Now, the leap from random KTP pictures to pre-populated fields requires significant data science and engineering effort, but that’s for another blog post. In this post, I want to consider the problem of data acquisition.

Step 0: Getting and Labelling the Data

Depending on the use case, data acquisition can be extremely challenging. Consider this exact use-case for a moment, and think about how you would go about getting images of KTP cards for you to train your OCR machine learning model on. You can trawl YouTube and Pinterest all you want, but you won’t be able to amass enough data to meaningfully train any model.

If you are fortunate enough to have customers (as we do!), then you’ll have some data to start with. But then, who is going to perform the annotations? Now, if you have folks in the company willing to do annotations (as we do!) then you should thank your lucky stars.

Here’s what we require our annotators to label:

--

--

Benjamin Tan Wei Hao
DKatalis

Author of The Little Elixir & OTP Guidebook, Mastering Ruby Closures, Building an ML Pipeline in Kubeflow. | Currently: Product Owner at @dkatalis.