Fraud Detection Methodology: Using Autoencoders to reduce a manual labelling effort

Jean Cupe
Axionable
Published in
5 min readMay 28, 2018

--

This Article aims to describe a Fraud Detection with a minimum of labelling Effort. For this purpuse we use a public dataset from Kaggle. This dataset is based on credit card transactions with 28 features (28 features as a result of a PCA process). Different analysis were done on this dataset and building a model to get the best accuracy is not the goal of this post. In contrast, we will use this dataset to develop a fraud detection methodology which could help us to reduce the labelling effort using an unsupervised learning approach.

The lack of labelled datasets

The main constraint to develop a fraud detection mechanism is the lack of labelled dataset. Tagging a transaction as fraudulent or normal is a difficult task not only for the small number of experts but primarily for the huge amount of transactions to analyse. Labelling millions of transactions and detecting the features to characterize frauds is not an easy task especially when the number of frauds is really smaller compared to normal transactions, i.e. fraud transactions represent 0.172% of transactions in the Kaggle’s dataset.

Using Autoencoders to reduce the labelling effort

An Autoencoder is a neural network architecture that is composed by an Encoder and a Decoder. The goal of an autoencoder is to copy its input to its output using a reconstruction process. The encoder will map the input in a hidden layer space and the decoder will reconstruct the input from the hidden layer space. There are different Autoencoder architectures depending on the dimensions used to represent the hidden layer space, and the inputs used in the reconstruction process.

In our approach, we use an Undercomplete Autoencoder which uses a dimensional reduction mechanism similar to PCA. The hidden layer space has less dimensions than the input, and we can see the encoding phase as a feature extraction process.

Due to the size of the dataset and the small number of features to characterize each transaction in the Kaggle’s dataset, our Autoencoder architecture has only two hidden layers in both the encoder and decoder. To avoid any overfitting in the autoencoder model, it is possible to use some regularisers as well as dropout layers in the architecture.

The common Loss Function used in autoencoders is the Squared Error which helps us to measure the reconstruction error of a datapoint. Then, the labelling effort can be reduced using a threshold for this reconstruction error. We select a threshold that divides the dataset in two groups: a group that contains 95% of the dataset with a reconstruction error smaller than the threshold and another group with 5% of transactions that have a big reconstruction error (higher than the threshold). Finally, only the last dataset (transactions with a higher reconstruction error) will be manually labelled i.e. classified as fraudulent or normal.

Even if the Kaggle’s dataset distinguishes each transactions as fraud or non-fraud, we did not use this information for all the transactions but only for 5% of transactions with a high reconstruction error to validate our methodology.

A drawback of the Autoencoder is that it does not distinguish fraudulent and normal transactions with similar reconstruction errors. However, it can detect a group of abnormal transactions which can include fraudulent and normal transactions that are difficult to reconstruct. Labelling only this group requires much less effort than the labelling of the original dataset. This is the main advantage of the autoencoder: reduce the number of samples to be manually labelled.

Building a MLP model using a labelled dataset

The former labelled dataset can be used to build a more precise prediction model. To keep using a deep learning approach, a Multi-Layer Perceptron (MLP) architecture was chosen.

The MLP model has three fully connected hidden layers. The output layer is a softmax layer with two outputs one for fraudulent transactions and the another one for normal transactions. Using a softmax layer helps us to predict a transaction as normal or fraudulent without the use of any explicit threshold, i.e. we can select the class with maximum probability as the predicted class. The task for the MLP is a multi-class classification, and the loss function used to train the model is a categorical-crossentropy loss.

Furthermore, an early stopping mechanism is used in the training process as well as a reduction of the learning rate on plateau i.e. if the validation loss does not improve after some epochs, the learning rate is reduced to the half and the training process goes on until the early stopping mechanism stop the training process.

The methodology to reduce labelling effort

Using the Autoncoder and the MLP models, we can build a methodology to reduce the labelling effort in big datasets and to update constantly the prediction model which can help us to discover new type of fraudulent transactions.

The methodology that we propose has three phases:

Phase 1: This phase uses the Autoencoder architecture to select a data subset for manual labelling. The threshold to select a data subset considers only the transactions that have a high reconstruction error. For the Kaggle’s dataset, we found that a threshold of 3 for reconstruction error helps us to select only 5% of the test set for manual labelling.

Phase 2: This phase uses the MLP architecture to build a prediction model using only the labelled data subset (5% of the test set from the first phase). This prediction model can be used in new datasets.

Phases 1 and 2: Building models with Unsupervised and Supervised learning approaches.

Phase 3: This phase aims to update the prediction model and to reduce the manual labelling for new datasets. The analysis of new subsets for manual labelling will help us to discover new types of frauds.

Phase 3: Update of prediction model.

Finally, it is very important to know the metrics we use to measure the performance of a fraud detection mechanism. The F1 score and the AUCPR (Area Under the Curve of Precision Recall) are widely used in unbalanced datasets.

Conclusion:

Build a fraud detection mechanism can be done using a deep learning approach even if the datasets are unbalanced. Using both supervised and unsupervised learning approaches is possible to reduce the effort of manually labelling a dataset. The discovery of new types of frauds can be included during the update of the model. In the last phase of the methodology, it is also possible to use new approaches to update the model such as Active Learning, but this new approach is out of the scope of this post.

Thanks for reading it! If you have any questions, I’ll be happy to answer them in the comments.

Jean from Axionable.

P.S. 1: This post was adapted from the one in Axionable Foodtruck.

P.S. 2: If you want to know more about Axionable, our projects and careers please visit us or follow us on Twitter.

--

--