Credit Card Fraud Detection with Azure Data Science Virtual Machine

Published in

Microsoft Azure

5 min readSep 17, 2018

This blog post is co-authored by Jaya Mathew and Francesca Lazzeri, data scientists at Microsoft.

People from across the data world came together in New York last week for the Strata Data Conference. In our session “A day in the life of a data scientist: How do we train our teams to get started with AI?”, we presented a scientific framework to help organizations improve data science skill set, systematically discover opportunities to create value from data, qualify new opportunities and assess their fit and potential, smoothly implement end-to-end advanced analytics pilots and projects, and produce sustainable ongoing business value from data.

We also walked through a detailed credit card fraud detection use case, from how the data typically gets collected to data wrangling, building a model, tuning the model, and operationalizing the model for a business to use in their production environment.

The goal of this blog post is to share with you more details on this end-to-end credit card fraud detection solution that we built using Python and Azure Data Science Virtual Machine.

Business Scenario

Recent advancements in computing technologies along with the increasing popularity of eCommerce platforms have radically amplified the risk of online fraud for financial services companies and their customers. Failing to properly recognize and prevent fraud results in billions of dollars of loss per year for the financial industry. This trend has urged companies to look into many popular artificial intelligence (AI) techniques, including deep learning for fraud detection. Deep learning can uncover patterns in tremendously large data sets and independently learn new concepts from raw data without extensive manual feature engineering. For this reason, deep learning has shown superior performance in domains such as object recognition and image classification.

Data Set

For this solution we used a sample data set from Kaggle that contains transactions made by credit cards in September 2013 by European cardholders. These transactions occurred in two days:

The data set can be summarized as follow:

· Features V1, V2, … V28: are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’

· Feature Time: contains the seconds elapsed between each transaction and the first transaction in the dataset

· Feature Amount: is the transaction Amount

· Feature Class: is the response variable and it takes value 1 in case of fraud and 0 otherwise

Machine Learning Approach

For this scenario we used a specific type of neural network called Autoencoder. This neural network is trained to attempt to copy its input to its output. Internally, it has a hidden layer h that describes a code used to represent the input.

The network may be viewed as consisting of two parts:

· an encoder function h = f(x)

· a decoder that produces a reconstruction r = g(h)

We optimize the parameters of our autoencoder model in such way that a special kind of error, reconstruction error is minimized.

Environment Set Up

To build our solution, we used a Data Science Virtual Machine, that is a Windows Azure virtual machine (VM) image. It is preinstalled and configured with several tools that are used for data analytics and machine learning. The Data Science Virtual Machine jump-starts your analytics project. You can work on tasks in various languages including R, Python, SQL, and C#.

To create an instance of the Microsoft Data Science Virtual Machine, follow these steps:

· Navigate to the virtual machine listing on the Azure portal. You may be prompted to login to your Azure account if you are not already signed in.

· Select the Create button at the bottom to be taken into a wizard.

· The wizard that creates the Microsoft Data Science Virtual Machine requires input. The following input is needed to configure each of the steps shown on the right of the figure:

a. Basics:

i. Name. The name of the data science server you’re creating.

ii. VM Disk Type. Choose SSD or HDD. For an NC_v1 GPU instance like NVidia Tesla K80 based, choose HDD as the disk type.

iii. User Name. The admin account ID to sign in.

iv. Password. The admin account password.

v. Subscription. If you have more than one subscription, select the one on which the machine is to be created and billed.

vi. Resource Group. You can create a new one or use an existing group.

vii. Location. Select the data center that’s most appropriate. For fastest network access, it’s the data center that has most of your data or is closest to your physical location.

b. Size. Select one of the server types that meets your functional requirements and cost constraints. For more choices of VM sizes, select View All.

c. Settings:

i. Use Managed Disks. Choose Managed if you want Azure to manage the disks for the VM. If not, you need to specify a new or existing storage account.

ii. Other parameters. You can use the default values. If you want to use nondefault values, hover over the informational link for help on the specific fields.

d. Summary. Verify that all the information you entered is correct. Select Create.

Model Development

We developed our model using Python and Azure Notebooks. First of all, you need to prepare your environment and import the necessary components:

You can now enter the credentials to access the data from the cloud and then download the file for analysis:

Import the credit card data set:

For the modelling piece, you first need exclude the variable ‘Time’. Since the spread of the variable ‘Amount’ is large, this variable is standardized. Then you have to define the framework for the autoencoder and then compile and fit using the training data:

Finally, you can save your model:

Conclusion

In this blog post, we dived into a specific credit card fraud detection use case. Most importantly, we showed how the right cloud analytics environment, such as an Azure Data Science Virtual Machine, makes it easy to collect data, analyze, experiment, and build a model for any organization to use in a production environment.