Credit Card Fraud Detection

Sowmya Buddharaju
4 min readJan 3, 2022

--

Using a dataset provided from Kaggle, I have applied three machine learning classification models (KNN, LDA and Logistic Regression) to detect credit card fraud.

Other than the transaction amount and time, the original features and background information cannot be provided with the dataset due to confidentiality issues, as indicated by Kaggle. The features (excluding Time and Amount) in the dataset hence have already been transformed with principle component analysis.

My code was inspired by Janio Martinez Bachmann’s works from Kaggle.

Exploratory Data Analysis

After reading the data into a pandas data frame, I have checked for the presence of any null or NaN values in the data. Then, I checked for the percentage of fraud and non-fraud cases in the dataset. The data was also scaled in order to normalize the range of values of the data’s features. This was done using the Robust Scaler from the sci-kit learn library.

data = pd.read_csv("creditcard.csv")
print("null and NaN values in dataset")
print("null: ", data.isnull().sum().sum())
print("nan: ", data.isna().sum().sum())
print("column names")
print(data.columns)
print("original dataset")
print("percentage of frauds: ", (len(data[data['Class'] == 1])/len(data)) * 100)
print("percentage of NO frauds: ", (len(data[data['Class'] == 0])/len(data)) * 100)
# scaling datarob_scaler = RobustScaler()
scaled_amount = rob_scaler.fit_transform(data['Amount'].values.reshape(-1,1))
scaled_time = rob_scaler.fit_transform(data['Time'].values.reshape(-1,1))
data.drop(['Time','Amount'], axis=1, inplace=True)
data.insert(0, 'scaled_amount', scaled_amount)
data.insert(1, 'scaled_time', scaled_time)

Upon calculating the amount of fraudulent and non-fraudulent cases in the data, I found that only 17% of the instances in the dataset were fraudulent cases, meaning that the dataset is heavily imbalanced. To prevent machine learning models from favouring non-fraudulent cases, I under-sampled the training dataset, in order to create balanced training data to feed into the machine learning models.

To better understand the data, a correlation matrix was created from the dataset. From the matrix it can be seen that variables such as V3, V10, V12 and V14 have a strong negative correlation with the output class, and thus would have a strong influence in training the model. Features V4 and V11 that seem to have a positive correlation with the output class would also be of importance during training.

Correlation Matrix from training data

The final thing that I did before training the models was to identify and remove outliers from the training data. By creating box-plots of features against Class, it can be seen that most of the features have many outliers. To remove these outliers, the interquartile range method was used. This method involves calculating the interquartile range (difference between the 75th and 25th percentile of the data).

Box-Plots to identify outliers
# box-plot to check for presence of outliers
f, axes = plt.subplots(ncols=5, nrows=6, figsize=(100, 50))
col = 0
row = 0
for column in training_data.columns:
if column != 'Class':
sns.boxplot(x='Class', y=column, data=training_data, ax=axes[row][col])
# axes[0][0].set_title(column, " vs target")
if (col + 1) % 5 == 0:
col = 0
row += 1
else:
col += 1
# removing outliers
for column in training_data.columns:
if column in ['Class']:
pass
else:
tmp = training_data[column].values
q1, q3 = np.percentile(tmp, 25), np.percentile(tmp, 75)
iqr = q3 - q1
cut_off = iqr * 1.75
lower_boundary, upper_boundary = q1 - cut_off, q3 + cut_off
outliers = [x for x in tmp if x < lower_boundary or x > upper_boundary]
training_data = training_data.drop(training_data[(training_data[column] > upper_boundary)].index)
training_data = training_data.drop(training_data[(training_data[column] < lower_boundary)].index)

Training Classifiers

Three classification models — K nearest neighbours, Linear Discriminant Analysis and Logistic Regressor was trained and optimised (through Grid Search) using sci-kit learn. The test accuracies obtained before and after grid search, as well as their ROC curves, are shown below:

KNN test accuracy: 98.89 %
optimised KNN test accurqacy: 99.14 %

Linear Discriminant Analysis test accuracy: 92.24 %
optimised Linear Discriminant Analysis test accuracy: 95.20 %

Logistic Regression test score: 95.06 %
optimised Logistic Regression test accuracy: 97.79 %

ROC curves left to right: KNN, LDA, LR (all optimised) || Area under Curves left to right: 0.73, 0.85, 0.84

Based on the accuracies shown above, the optimised KNN model appears to perform the best among the three models. However, the logistic regression model also appears to perform well, and it has the highest area under the ROC curve out of the three. A high AUC is ideal as it indicates that the model is able to distinguish between positive and negative classes well.

--

--