Credit Card Fraud Detection using OCR & Autoencoders in Keras.
Improvised Model with better accuracy.
Co-Author :Tejas Shenoy
In today’s world the credit card frauds are very vulnerable to each one of us. In a day all around the world millions of transactions get carried out. It is very much possible that the person doesn't carry out the transaction still the credit is done from his/her account.
There are many ways to detect whether the transaction is a fraud or not.
In our project we detect the transaction on basis of behavior of the transaction.
The detection of fraud transactions can be used by banking companies and other Money transaction applications to inform their customers if a fraud transaction occurs and strict measures can be then taken.
It can be also considered as a safety measure taken to avoid fraud transaction.
If a fraud transaction occurs the company can immediately inform its customers and verify with them.
We are basically going to use 2 different models to optimize our output thereby maintaining our accuracy high.
OUTLINE
Module 1:
OCR- a font, a font created specifically to aid Optical Character Recognition algorithms. We’ll then devise a computer vision and image processing algorithm that can localize the four groupings of four digits on a credit card, Extract each of these four groupings followed by segmenting each of the sixteen numbers individually, Recognize each of the sixteen credit card digits by using template matching and the OCR- a font.
So this is how Module 1 works:
1. Takes a reference image and extracts the digits.
2. Stores the digit templates in a dictionary.
3. Localizes the four credit card number groups, each holding four digits (for a total of 16 digits).
4. Extracts the digits to be “matched”.
5. Performs template matching on each digit, comparing each individual ROI to each of the digit templates 0–9, whilst storing a score for each attempted match.
6. Finds the highest score for each candidate digit, and builds a list called output which contains the credit card number.
7. Outputs the credit card number and credit card type to our terminal and displays the output image to our screen with database details .
8. Comparing with Luhn algorithm it tells whether it is valid or invalid card.
This will help us to know whether the cardis present in our database or not.
If the card gets validated we move on to Module 2.
Module 2:
First , we obtained our dataset from Kaggle(https://www.kaggle.com/mlg-ulb/creditcardfraud) a data analysis website which provides datasets. Inside this dataset, there are 31 columns out of which 28 are named as v1-v28 to protect sensitive data. The other columns represent Time, Amount and Class. ● Time shows the time gap between the first transaction and the following one.
● Amount is the amount of money transacted.
● Class 0 represents a valid transaction and 1 represents a fraudulent one.
After checking this dataset, we plot a histogram for every column. This is done to get a graphical representation of the dataset which can be used to verify that there are no missing any values in the dataset. This is done to ensure that we don’t require any missing value imputation and the machine learning algorithms can process the dataset smoothly.
After this analysis, we plot a heatmap to get a coloured representation of the data and to study the correlation between out predicting variables and the class variable.
Algorithm Used :
• Local Outlier Factor.
• Isolation Forest Algorithm.
We have a highly unstable dataset ,lets take a look:
Do fraudulent transactions occur more often during certain time?
Now lets take time constraint and analyze our dataset.
Doesn’t seem like the time of transaction really matters.
So now lets see why we used Autoencoders:
An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”.
Reconstruction error
We optimize the parameters of our Autoencoder model in such way that a special kind of error — reconstruction error is minimized.
Training our Autoencoder is gonna be a bit different from what we are used to. Let’s say you have a dataset containing a lot of non fraudulent transactions at hand. You want to detect any anomaly on new transactions. We will create this situation by training our model on the normal transactions, only. Reserving the correct class on the test set will give us a way to evaluate the performance of our model.
Our Autoencoder uses 4 fully connected layers with 14, 7, 7 and 29 neurons respectively. The first two layers are used for our encoder, the last two go for the decoder.
Precision vs Recall
Precision and recall are defined as follows:
Finally we construct a graph which has reconstruction of all the classes:
Our output is in the form of Confusion Matrix
Our model seems to catch a lot of the fraudulent cases.The number of normal transactions classified as frauds is really high. Is this really a problem? Probably it is. You might want to increase or decrease the value of the threshold, depending on the problem. That one is up to you.
CONCLUSION :
Credit Card fraud is an criminal act and a act of dishonesty. In this project we find out certain methods/ways of how the fraud can be stopped by analysing the transactions or by checking the credit card.
In the first module we use OCR technique to find out he credit card number which gives less accuracy with real images but has better accuracy if the images are scanned properly;thus telling us about the originality of the card.
In the second module we use the machine learning algorithms to analyse the data set with consists of various transactions at different time by different people. The algorithm reaches 99.6% accuracy but precison remains low at 28%. We get high accuracy due to imbalance in the data set of valid and invalid transactions.
Future Scope:
While we couldn’t reach our goal of 100% accuracy in fraud detection, we did end up creating a system that can, with enough time and data, get very close to that goal. As with any such project, there is some room for improvement here.
The very nature of this project allows for multiple algorithms to be integrated together as modules and their results can be combined to increase the accuracy of the final result. This model can further be improved with the addition of more algorithms into it. However, the output of these algorithms needs to be in the same format as the others.
References:
- Big thanks to Venelin Valkov
2. Tom Fawcett(NY), Foster Provost(NY), “Combining Data Mining and Machine Learning for Effective Fraud Detection”, AAAI Technical Report WS-97–07.
3. International Journal of Advanced Research in Computer and Communication EngineeringVol. 3, Issue 5,May2014 -Survey on CreditCard Fraud Detection Using HiddenMarkov Model .