Develop Fraud detection with python

Fraud

What is Fraud?

In law, fraud is deliberate deception to secure unfair or unlawful gain, or to deprive a victim of a legal right. Fraud itself can be a civil wrong (i.e., a fraud victim may sue the fraud perpetrator to avoid the fraud or recover monetary compensation), a criminal wrong (i.e., a fraud perpetrator may be prosecuted and imprisoned by governmental authorities), or it may cause no loss of money, property or legal right but still be an element of another civil or criminal wrong. The purpose of fraud may be monetary gain or other benefits, such as obtaining a passport or travel document, driver’s license or qualifying for a mortgage by way of false statements.

In this article we use kaggle can start create your model.

Introduction

The PwC global economic crime survey of 2016 suggests that approximately 36% of organizations experienced economic crime. Therefore, there is definitely a need to solve the problem of credit card fraud detection. The task of fraud detection often boils down to outlier detection, in which a dataset is scanned through to find potential anomalies in the data. In the past, this was done by employees which checked all transactions manually. With the rise of machine learning, artificial intelligence, deep learning and other relevant fields of information technology, it becomes feasible to automate this process and to save some of the intensive amount of labor that is put into detecting credit card fraud. In the following sections, my machine learning based Pythonic approach is explained.

Requirements:

  • Python3
  • install packages (numpy, pandas, matplotlib, seaborn, sklearn)

1. import libraries:

import sys
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

if not sys.warnoptions:
warnings.simplefilter("ignore")

2. Load the data:

# Read data from csv file
df = pd.read_csv('data/creditcard.csv')
print(df.head(2))
df.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time 284807 non-null float64
V1 284807 non-null float64
V2 284807 non-null float64
V3 284807 non-null float64
V4 284807 non-null float64
V5 284807 non-null float64
V6 284807 non-null float64
V7 284807 non-null float64
V8 284807 non-null float64
V9 284807 non-null float64
V10 284807 non-null float64
V11 284807 non-null float64
V12 284807 non-null float64
V13 284807 non-null float64
V14 284807 non-null float64
V15 284807 non-null float64
V16 284807 non-null float64
V17 284807 non-null float64
V18 284807 non-null float64
V19 284807 non-null float64
V20 284807 non-null float64
V21 284807 non-null float64
V22 284807 non-null float64
V23 284807 non-null float64
V24 284807 non-null float64
V25 284807 non-null float64
V26 284807 non-null float64
V27 284807 non-null float64
V28 284807 non-null float64
Amount 284807 non-null float64
Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
'''
print(df.describe())
'''
Time V1 V2 V3 V4 \
count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean 94813.859575 1.165980e-15 3.416908e-16 -1.373150e-15 2.086869e-15
std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00 1.415869e+00
min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00
25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01
50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -1.984653e-02
75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00 7.433413e-01
max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00 1.687534e+01
V5            V6            V7            V8            V9  \
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean 9.604066e-16 1.490107e-15 -5.556467e-16 1.177556e-16 -2.406455e-15
std 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00 1.098632e+00
min -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01
25% -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01
50% -5.433583e-02 -2.741871e-01 4.010308e-02 2.235804e-02 -5.142873e-02
75% 6.119264e-01 3.985649e-01 5.704361e-01 3.273459e-01 5.971390e-01
max 3.480167e+01 7.330163e+01 1.205895e+02 2.000721e+01 1.559499e+01
...                 V21           V22           V23           V24  \
count ... 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean ... 1.656562e-16 -3.444850e-16 2.578648e-16 4.471968e-15
std ... 7.345240e-01 7.257016e-01 6.244603e-01 6.056471e-01
min ... -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00
25% ... -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01
50% ... -2.945017e-02 6.781943e-03 -1.119293e-02 4.097606e-02
75% ... 1.863772e-01 5.285536e-01 1.476421e-01 4.395266e-01
max ... 2.720284e+01 1.050309e+01 2.252841e+01 4.584549e+00
V25           V26           V27           V28         Amount  \
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 284807.000000
mean 5.340915e-16 1.687098e-15 -3.666453e-16 -1.220404e-16 88.349619
std 5.212781e-01 4.822270e-01 4.036325e-01 3.300833e-01 250.120109
min -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01 0.000000
25% -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02 5.600000
50% 1.659350e-02 -5.213911e-02 1.342146e-03 1.124383e-02 22.000000
75% 3.507156e-01 2.409522e-01 9.104512e-02 7.827995e-02 77.165000
max 7.519589e+00 3.517346e+00 3.161220e+01 3.384781e+01 25691.160000
Class  
count 284807.000000
mean 0.001727
std 0.041527
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
'''

3. Repair data:

byClass = df.groupby('Class')
count = df.groupby(["Class"])["Time"].count()
sns.countplot(x='Class', data=df) # Pictute_1
Pictute_1
sns.barplot(x='Class', y='Amount', data=df) # Pictute_2
Pictute_2

4. Import machine learning libraries

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

5. Customize X, y to train or test:

# Only use the 'Amount' and 'V1', ..., 'V28' features
features = ['Amount'] + ['V%d' % number for number in range(1, 29)]
print(features) # Column list for X
# ['Amount', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28']
# The target variable which we would like to predict, is the 'Class' variable
target = 'Class'

# Now create an X variable (containing the features) and an y variable (containing only the target variable)
X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

6. Create Model:

# Define the model
model = LogisticRegression()

# Fit and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

7. Check the model results:

# And finally: show the results
print(classification_report(y_test, y_pred))
'''
precision recall f1-score support
          0       1.00      1.00      1.00    142161
1 0.88 0.59 0.71 243
avg / total       1.00      1.00      1.00    142404
'''
print(confusion_matrix(y_test, y_pred))
'''
[[142141 20]
[ 99 144]]

'''

8. Conclusion:

This is actually a great result!

The 0 classes (transactions without fraud) are predicted with 100% precision and recall.

It has some issues with detecting the 1 classes (transactions which are fraudulent). It can predict fraud with 88% precision. This means that 12% of the transactions which are fraudulent remain undetected by the system. But, 88% is still quite good!