Leveraging Machine Learning in Fraud Detection

Johann Ko
Government Digital Services, Singapore
4 min readJan 13, 2020

In the past few weeks as an intern at GovTech, I have had the opportunity to work on fraud detection in applications for Government grants. This work has led me to explore using Generative Adversarial Networks (GANs) to model the grant applications and to possibly discriminate fraudulent and non-fraudulent applications.

So, what are GANs?

For those of you who are unfamiliar with GANs, it is a class of machine learning model that performs unsupervised learning (in the sense that training data input does not have to be labelled with “positive” or “negative”, for example) and learns to generate synthetic data with the same characteristics as the training data. Currently, the best-known applications for GANs are in fake image generation (e.g. Deepfakes, Speech2Face).

TV host Jimmy Fallon and John Oliver face swap using Deepfakes (Source)

In essence, GANs consist of a Discriminator model and a Generator model. The Generator learns to create a set of fake data and hopes to pass it off as genuine in the eyes of the Discriminator model. This is similar to a forger aiming to re-create a fake painting. The Discriminator, on the other hand, acts like a detective on a lookout for forged paintings. It will have to decide whether the data presented to it is authentic or generated. Both models are pitted against each other during training — the generator gets better at fooling the Discriminator while the Discriminator improves its ability to distinguish authentic data from the generated data.

Process flow in a GAN (Source)

Applying it to fraud detection in government grants

Typical implementations, like Deepfakes, isolate the Generator model after training to create synthetic images of people. However, in our case, we hypothesized that the Discriminator, having been trained to discern between real and fake data, would be able to flag out grant applications that do not look realistic and therefore possibly fraudulent.

Understanding the data

While the development of intricate GAN models is exciting, the first and most crucial stage in any data science application is to understand the data. We sieved through multiple datasets pertaining to Government grants to identify which of the available data features are more likely to be indicative of non-normal and fraudulent behavior in the grant application. This selection requires both common sense (for instance, to recognize that the “email address of Grant Applicant” would likely not be a useful feature) and domain knowledge in the grant processing workflow (much thanks to Chloe and other system experts for this!) to identify possible methods of fraud. We settled with a subset of 7 numerical features to test out the viability of a GAN fraud detector.

Modelling

TensorFlow and Spark were used to create the GAN model — namely, the Discriminator and Generator model. Subsequently, they are paired with RMSprop optimiser. RMSprop optimiser manages the feedback to both Discriminator and Generator model during trainings. It also has a “momentum” function that pushes the models out of a local minimum during learning. This makes it more likely for the models to find their respective global optimum solution. Mini-batch training is then utilised to train the GAN model.

Results, Conclusions, and Take away

We generated synthetic grant applications by uniformly sampling feature values from a reasonable numerical range for each selected feature. These are then tested against real grant applications with the discriminator model. The model performed perfectly in discerning between the real and fake grants. It indicates that the model was able to somewhat learn a “criteria of normalcy” in grant feature values.

We can also attempt to use this discriminator model to detect fraudulent grant claims from real data by assuming a correlation between fraud risk and anomalous data(i.e. fraudulent claims are anomalous in relation to legitimate claims). This can be achieved by ranking the grants based on the model’s confidence of normalcy and inspecting those with the lowest confidence scores. Unfortunately, my internship ended before we could embark on this work.

Despite GANs’ traditional usage in imagery generation, our implementation as well as many recent academic studies show that there are merits in deploying the GAN discriminator as a type of anomaly detector. Of course, the work done here must be extended to benchmark the performance of the GAN discriminator model against more traditional unsupervised anomaly detection methods. To learn more about GANs and its other applications, you can head over to

Shoutouts

Shoutout to GovTech DSAID and GDS for the incredible learning opportunity! It has definitely been a joy to work with talented and helpful individuals in these two divisions. The cultures in both divisions are empowering and energetic which make it conducive to bounce novel ideas around. Special mention to Woon Peng (@woonpenggoh), Chloe (@chloeyieee) and the rest of the team for providing me with guidance throughout this experience.

Do check out more of the interesting stuff we work on at https://medium.com/dsaid-govtech and at https://blog.gds-gov.tech/

We’re hiring!

If this sounds interesting to you, we are looking for summer interns to join us! Get in touch with the team by sending your CV over to recruit@dsaid.gov.sg!

- Johann Ko

--

--