Machine Learning @ DKatalis: Generating Synthetic Data with Photoshop and Python For Great Good!

Benjamin Tan Wei Hao
DKatalis

--

Machine learning is an expensive affair. Training models costs money, and even more when GPUs are involved. However, as most companies would delve into any non-trivial machine learning would find out, it is the data that accounts for a large part of the costs.

In this post, I detail how I used Photoshop and Python to generate thousands (10k+) of synthetic Indonesia identification cards to train an OCR segmentation and recognizer deep learning model.

The Problem Domain

This is an example of an Indonesian identification card, also known as Kartu Tanda Penduduk, or KTP for short.

Source: https://m.suarasindo.com/read-2236-2019-07-28-masyarakat-jangan-sembarang-unggah-data-ktpel-dan-kk-di-internet.html

As part of the on-boarding process for the Jago Bank application, we want to allow users to upload pictures of their KTP, and have the important information immediately pre-populated in the next step, such as NIK (identification number), the name, and address.

Now, the leap from random KTP pictures to pre-populated fields requires significant data science and engineering effort, but that’s for another blog post. In this post, I…

--

--

Benjamin Tan Wei Hao
DKatalis

Author of The Little Elixir & OTP Guidebook, Mastering Ruby Closures, Building an ML Pipeline in Kubeflow. | Currently: Product Owner at @dkatalis.