Machine Learning Project - Credit Card Fraud Detection

Rahul Patodi
DataFlair
Published in
6 min readApr 23, 2024

The Credit Card Fraud Detection project focuses on developing a machine learning model to detect fraudulent transactions in credit card data. The dataset used contains various attributes, including ‘Time’, ‘V1’ through ‘V28’, ‘Amount’, and ‘Class’. These attributes capture different aspects of credit card transactions, such as time, transaction amounts, and anonymized features derived from the transaction data.

The goal of the project is to leverage machine learning algorithms to accurately identify fraudulent transactions based on patterns and anomalies in the credit card data. By analyzing the transaction features and detecting unusual behavior, the model aims to differentiate between legitimate and fraudulent transactions effectively.

Credit card fraud detection presents several challenges due to the evolving nature of fraudulent activities and the need for real-time detection to prevent financial losses. One significant challenge is the sheer volume of transactions processed daily, making it difficult to distinguish between legitimate and fraudulent transactions accurately. Fraudsters constantly adapt their techniques, employing sophisticated methods such as stolen card details, identity theft, and account takeover schemes, which require advanced detection algorithms to identify patterns and anomalies effectively.

The Machine Learning Project endeavors to enhance fraud detection mechanisms employed by financial institutions and credit card companies. By identifying fraudulent activities early, the model contributes to reducing financial losses and protecting consumers from fraudulent transactions.

Credit Card Fraud Detection
Credit Card Fraud Detection

Dataset

The dataset can be downloaded from this link. The attributes of the dataset are:

  • Time: Represents the time elapsed between transactions. This attribute helps in analyzing transaction patterns over time.
  • V1-V28: These are anonymized features resulting from principal component analysis (PCA) to protect the confidentiality of sensitive information. They represent various transaction parameters such as transaction amounts, merchant IDs, and other transaction-related details.
  • Amount: Denotes the transaction amount involved in each credit card transaction. This attribute provides valuable information about the financial aspect of the transaction.
  • Class: Indicates whether a transaction is fraudulent or legitimate. It is a binary attribute where ‘1’ typically represents a fraudulent transaction, and ‘0’ represents a legitimate one. This attribute serves as the target variable for the fraud detection model.

Prerequisites For Machine Learning Credit Card Fraud Detection Project

The prerequisites required are:

NumPy:

  • Understanding of arrays and matrix operations.
  • Ability to perform numerical computations efficiently.

Pandas

  • Proficiency in handling and analyzing structured data.

Matplotlib

  • Knowledge of basic plotting techniques, including line plots, scatter plots, and histograms.
  • Understanding of subplots for creating multiple plots in a single figure.
  • Familiarity with advanced plot types, such as heatmaps, contour plots, and geographical visualizations.

Seaborn:

  • Basic understanding of how to create various types of plots and charts in seaborn
  • Familiarity with basics of Seaborn like syntax, functions, methods etc

scikit learn:

  • Familiarity with machine learning concepts, such as supervised and unsupervised learning.
  • Understanding of model selection, training, and evaluation procedures.

Steps for Machine Learning Credit Card Fraud Detection Project

Code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import gridspec

Code Explanation: This code snippet imports necessary libraries for data analysis and visualization, including Pandas for data manipulation, NumPy for numerical operations, Seaborn for statistical plotting, and Matplotlib for basic plotting functionalities. These libraries are commonly used in data analysis and visualization tasks to explore datasets, derive insights, and present findings through graphical representations.

Code:

data = pd.read_csv("/content/creditcard.csv")
data.head(10)

Code Explanation: The code snippet reads a CSV file named “creditcard.csv” into a Pandas DataFrame named “data”. It then displays the first 10 rows of the DataFrame using the head() function.

Output:

creditcard-csv

Code:

data.isnull().sum()

Code Explanation: This code checks for missing values in each column of the DataFrame by applying the isnull() function, which returns a DataFrame of boolean values indicating whether each element is missing or not. Then, the sum() function is applied to count the number of missing values in each column. This provides an overview of how many missing values are there in each column of the dataset.

Output:

Count of Null Values
Count of Null Values

Code:

data.dropna(inplace = True)

Code Explanation: The dropna() method in pandas is used to remove rows or columns with missing values (NaN). In this case, inplace=True parameter is set, indicating that the changes should be made directly to the DataFrame, rather than returning a new DataFrame.

Code:

fraud = data[data['Class'] == 1]
valid = data[data['Class'] == 0]
outlierFraction = len(fraud)/float(len(valid))
print(outlierFraction)
print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))
print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))

Code Explanation: The code segment segregates the dataset into two subsets: one for fraudulent transactions and another for valid transactions. It calculates the outlier fraction, which represents the ratio of fraudulent transactions to valid ones. Additionally, it prints the counts of fraud cases and valid transactions in the dataset.

Output:

Fraud and Valid Count
Fraud and Valid Count

Code:

X = data.drop(['Class'], axis = 1)
Y = data["Class"]
print(X.shape)
print(Y.shape)
xData = X.values
yData = Y.values

Code Explanation: This code segment separates the features (X) and the target variable (Y) from the dataset. It then prints the shapes of the feature and target arrays to confirm the dimensions. Finally, it converts the feature and target data into numpy arrays for further processing.

Output:

X and Y Shape
X and Y Shape

Code:

from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(xData, yData, test_size = 0.2, random_state = 42)

Code Explanation: This code utilizes the train_test_split function from scikit-learn to split the dataset into training and testing sets. It assigns 80% of the data to the training set (xTrain and yTrain) and 20% to the testing set (xTest and yTest). The random_state parameter ensures reproducibility by fixing the random seed used for the data shuffling.

The main purpose of the train-test split is to partition the dataset into two subsets: one for training the model and the other for evaluating its performance. The training set is used to fit the model, while the testing set is used to assess its generalization ability on unseen data.

Code:

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(xTrain, yTrain)
yPred = rfc.predict(xTest)

This code trains a Random Forest Classifier (RandomForestClassifier) using the training data (xTrain and yTrain) and then predicts the classes of the test data (xTest). The classifier is instantiated without any specified hyperparameters, so it uses default settings.

Code:

from sklearn.metrics import classification_report, accuracy_score  
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix

n_outliers = len(fraud)
n_errors = (yPred != yTest).sum()
print("The model used is Random Forest classifier")

acc = accuracy_score(yTest, yPred)
print("The accuracy is {}".format(acc))

prec = precision_score(yTest, yPred)
print("The precision is {}".format(prec))

rec = recall_score(yTest, yPred)
print("The recall is {}".format(rec))

f1 = f1_score(yTest, yPred)
print("The F1-Score is {}".format(f1))

MCC = matthews_corrcoef(yTest, yPred)
print("The Matthews correlation coefficient is {} ".format(MCC))

Code Explanation: This code evaluates the performance of the Random Forest Classifier model using various metrics including accuracy, precision, recall, F1-score, and Matthews correlation coefficient. It calculates the number of outliers (fraudulent transactions) and errors in the predictions. Then, it prints out the accuracy, precision, recall, F1-score, and Matthews correlation coefficient of the model.

Output:

Metrics
Metrics

Conclusion

The Credit Card Fraud Detection project utilized machine learning techniques, particularly the Random Forest Classifier, to identify fraudulent transactions in credit card data. The analysis involved preprocessing the dataset, splitting it into training and testing sets, and training the model on the training data. The model’s performance was evaluated using various metrics, including accuracy, precision, recall, F1-score, and Matthews correlation coefficient.

Overall, the Random Forest Classifier demonstrated promising results in detecting fraudulent transactions, achieving notable accuracy and effectiveness in identifying outliers.

The credit card fraud detection project underscores the importance of proactive measures in combating fraud in the financial sector. Through the implementation of sophisticated machine learning algorithms and continuous monitoring of transactional data, financial institutions can enhance their fraud detection capabilities, safeguarding the interests of customers and maintaining the integrity of the financial system.

Overall, the Random Forest Classifier demonstrated promising results in detecting fraudulent transactions, achieving notable accuracy and effectiveness in identifying outliers.

--

--