Machine Learning @ DKatalis: Generating Synthetic Data with Photoshop and Python For Great Good!
Machine learning is an expensive affair. Training models costs money, and even more when GPUs are involved. However, as most companies would delve into any non-trivial machine learning would find out, it is the data that accounts for a large part of the costs.
In this post, I detail how I used Photoshop and Python to generate thousands (10k+) of synthetic Indonesia identification cards to train an OCR segmentation and recognizer deep learning model.
The Problem Domain
This is an example of an Indonesian identification card, also known as Kartu Tanda Penduduk, or KTP for short.
As part of the on-boarding process for the Jago Bank application, we want to allow users to upload pictures of their KTP, and have the important information immediately pre-populated in the next step, such as NIK (identification number), the name, and address.
Now, the leap from random KTP pictures to pre-populated fields requires significant data science and engineering effort, but that’s for another blog post. In this post, I…