A Comprehensive Guide to Credit Card Fraud Detection: Leveraging Data Science from Start to Finish

8 min readSep 1, 2023

Introduction

Credit card fraud is an ever-escalating issue that costs businesses and individuals billions of dollars annually. As technology advances, so does the skill set of fraudsters. Therefore, it’s more important than ever to leverage sophisticated techniques for detecting fraudulent activities.

The aim of this article is to present a comprehensive, step-by-step guide to tackling credit card fraud detection using data science methodologies. Through a detailed analysis, we aim to construct a machine learning model capable of identifying fraudulent transactions with high accuracy.

To achieve this, we will employ the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology, a structured approach to planning and executing data science tasks. The CRISP-DM model consists of six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.

Explaining the Dataset

The dataset is a collection of credit card transactions made over two days in September 2013 by European cardholders. It is a publicly available dataset from Kaggle with the following specifications:

Number of Rows: 284,807 transactions
Number of Columns: 31 features

Anonymized Features: Due to confidentiality reasons, the majority of the dataset’s features (V1 to V28) are principal components obtained through PCA (Principal Component Analysis).

Time and Amount Features: The “Time” feature captures the seconds elapsed between each transaction and the first transaction. “Amount” is the transaction amount for each record.

Target Variable: The “Class” feature is our target variable, where ‘1’ indicates a fraudulent transaction and ‘0’ signifies a legitimate transaction.

Exploratory Data Analysis (EDA)

Correlation Heatmap

Our initial step was to understand the data distribution, particularly the imbalance between fraudulent and legitimate transactions. We found that fraudulent transactions make up an extremely small proportion of the total transactions, highlighting the challenge of imbalanced classification.

The heatmap allowed us to observe the relationships between different features. We noticed that certain features like V14, V10, and V17 have strong negative correlations with the "Class" label, making them important features for our model.

Box Plots for Features

The box plots provided insights into the distribution and potential outliers among the anonymized features, crucial for understanding the data’s underlying structure.

Histograms for Time and Amount

The histograms for “Time” and “Amount” demonstrated the distribution and frequency of transactions over time and across different transaction amounts, respectively.

Data Preparation

Addressing Imbalance through Oversampling
We used manual oversampling to duplicate instances of the minority class (fraudulent transactions), creating a balanced training dataset. This helped ensure that our model would not be biased toward predicting only the majority class.

# Redefine features and target variable from the dataset
X = df.drop('Class', axis=1)
y = df['Class']

# Split the data into training and testing sets again
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Identify the minority and majority classes
majority_class = y_train.value_counts().idxmax()
minority_class = y_train.value_counts().idxmin()

# Separate the majority and minority classes in the training dataset
X_train_majority = X_train[y_train == majority_class]
X_train_minority = X_train[y_train == minority_class]
y_train_majority = y_train[y_train == majority_class]
y_train_minority = y_train[y_train == minority_class]

# Oversample the minority class to match the number of observations in the majority class
X_train_minority_oversampled = X_train_minority.sample(len(X_train_majority), replace=True, random_state=42)
y_train_minority_oversampled = y_train_minority.sample(len(y_train_majority), replace=True, random_state=42)

# Combine the majority class with the upsampled minority class
X_train_oversampled = pd.concat([X_train_majority, X_train_minority_oversampled])
y_train_oversampled = pd.concat([y_train_majority, y_train_minority_oversampled])

# Check the new distribution of the target variable in the training set
y_train_oversampled.value_counts(normalize=True)

Distribution of Target Variable in Oversampled Training Set

The bar chart confirms that we have successfully balanced the classes in the training dataset. Both fraudulent (Class 1) and non-fraudulent (Class 0) transactions are now equally represented, each constituting 50% of the data.

With this balanced training set, we’re better positioned to train a machine learning model that can effectively identify both classes.

Feature Selection
Initial modeling using Random Forest indicated that not all features were equally important for predicting fraud. Therefore, we focused on a subset of features that Random Forest identified as the most important, namely V14, V10, V4, V12, and V17.

Outlier Analysis and Processing
Outliers can significantly affect the performance of machine learning models. We calculated Z-scores for the selected features and capped outliers to minimize their influence on the model.

Here are the counts of outliers detected in the top 5 important features within the subset of the balanced training data:

V14: 132 outliers
V10: 393 outliers
V4: 27 outliers
V12: 227 outliers
V17: 412 outliers

Modelling

Logistic Regression Model & Evaluation Metrics
The Logistic Regression model served as our baseline model. While it showed high recall, its precision was low, leading to many false positives.

from sklearn.metrics import classification_report, roc_auc_score
from sklearn.linear_model import LogisticRegression

# Select the top 5 important features for modeling
X_train_selected = X_train_oversampled[important_features]
X_test_selected = X_test[important_features]

# Initialize the Logistic Regression model
log_reg_model = LogisticRegression(random_state=42)

# Fit the model on the balanced training set with selected features
log_reg_model.fit(X_train_selected, y_train_oversampled)

# Make predictions on the test set
y_pred_log_reg = log_reg_model.predict(X_test_selected)

# Evaluate the model
classification_rep_log_reg = classification_report(y_test, y_pred_log_reg, target_names=['Non-Fraudulent', 'Fraudulent'])
roc_auc_log_reg = roc_auc_score(y_test, y_pred_log_reg)

classification_rep_log_reg, roc_auc_log_reg

Here are the key metrics for the Logistic Regression model:

Precision: The model has a precision of 0.06 for detecting fraudulent transactions, which is quite low.
Recall: The recall for fraudulent transactions is 0.93, which is quite high. This means the model is capturing most of the positive (fraudulent) cases.
F1-Score: The F1-score for fraudulent transactions is 0.11, indicating that there is room for improvement in the balance between precision and recall.
AUC-ROC: The area under the ROC curve is approximately 0.951, which is a good indicator of the model’s ability to distinguish between the two classes.

Random Forest Model & Evaluation Metrics

Given that Random Forest models can be computationally intensive, especially on large datasets, we’ll use a subset of the balanced training set to fit the model initially. This will help us gauge the performance before fine-tuning.

We’ll use the top 5 important features (V14, V10, V4, V12, V17) for training the model and evaluate it using the same metrics as before: Precision, Recall, F1-Score, and AUC-ROC.

from sklearn.ensemble import RandomForestClassifier

# Sample a subset of the balanced training data for model training
model_subset_size = int(0.1 * len(X_train_oversampled))  # 10% of the data
X_model_subset = X_train_oversampled.sample(n=model_subset_size, random_state=42)[important_features]
y_model_subset = y_train_oversampled.sample(n=model_subset_size, random_state=42)

# Initialize the Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)

# Fit the model on the subset of the balanced training set with selected features
rf_model.fit(X_model_subset, y_model_subset)

# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test_selected)

# Evaluate the model
classification_rep_rf = classification_report(y_test, y_pred_rf, target_names=['Non-Fraudulent', 'Fraudulent'])
roc_auc_rf = roc_auc_score(y_test, y_pred_rf)

classification_rep_rf, roc_auc_rf

Here are the key metrics for the Random Forest model:

Precision: The model has a precision of 0.62 for detecting fraudulent transactions, which is an improvement over the Logistic Regression model.
Recall: The recall for fraudulent transactions is 0.87, still quite high, meaning the model is capturing a large portion of the positive (fraudulent) cases.
F1-Score: The F1-score for fraudulent transactions has improved to 0.72, indicating a better balance between precision and recall compared to the Logistic Regression model.
AUC-ROC: The area under the ROC curve is approximately 0.933, which is also a good indicator of the model’s ability to distinguish between the two classes.

Hyperparameter Tuning

Hyperparameter tuning involves adjusting the settings of a machine learning algorithm to optimize its performance. For Random Forest, some key hyperparameters include:

n_estimators: The number of trees in the forest.
max_depth: The maximum depth of the trees.
min_samples_split: The minimum number of samples required to split an internal node.
min_samples_leaf: The minimum number of samples required to be at a leaf node.

Results:
The simplified grid search has identified the following optimal hyperparameters for the Random Forest model:

max_depth: 20
n_estimators: 50
These hyperparameters are expected to yield the best F1-score based on the subset of the balanced training set.

Modeling — Hyperparameter Tuning

We’ll train the model on a subset of the balanced training data with the optimized hyperparameters and then evaluate its performance on the test set. The evaluation metrics will be the same as before: Precision, Recall, F1-Score, and AUC-ROC.

Let’s go ahead and evaluate the tuned Random Forest model.

# Initialize the Random Forest Classifier with the best hyperparameters
tuned_rf_model = RandomForestClassifier(n_estimators=best_params_simplified['n_estimators'], 
                                        max_depth=best_params_simplified['max_depth'],
                                        random_state=42)

# Fit the model on the subset of the balanced training set with selected features
tuned_rf_model.fit(X_model_subset, y_model_subset)

# Make predictions on the test set
y_pred_tuned_rf = tuned_rf_model.predict(X_test_selected)

# Evaluate the model
classification_rep_tuned_rf = classification_report(y_test, y_pred_tuned_rf, target_names=['Non-Fraudulent', 'Fraudulent'])
roc_auc_tuned_rf = roc_auc_score(y_test, y_pred_tuned_rf)

classification_rep_tuned_rf, roc_auc_tuned_rf

Tuned Random Forest Model Evaluation:

Here are the key metrics for the Random Forest model with tuned hyperparameters:

Precision: The model has a precision of 0.60 for detecting fraudulent transactions, which is slightly lower than the untuned Random Forest model.
Recall: The recall for fraudulent transactions is 0.86, which is still high.
F1-Score: The F1-score for fraudulent transactions is 0.71, which is comparable to the untuned Random Forest model.
AUC-ROC: The area under the ROC curve is approximately 0.928, which is a good indicator of the model’s ability to distinguish between the two classes.

The tuned Random Forest model shows good performance in terms of recall and F1-score, making it a strong candidate for our final model.

Evaluation — Final Recommendations

After going through the CRISP-DM methodology, we have arrived at the following insights and recommendations:

Key Insights:

Data Imbalance: The dataset was highly imbalanced, with fraudulent transactions making up a very small proportion. We used manual oversampling to balance the classes in the training set.
Important Features: Features like V14, V10, V4, V12, and V17 were identified as the most important for predicting fraudulent transactions.
Model Performance: Random Forest outperformed Logistic Regression in terms of precision, recall, and F1-score. The tuned Random Forest model achieved an F1-score of 0.71 and an AUC-ROC of approximately 0.928.

Recommendations:

Model for Deployment: The tuned Random Forest model is recommended for detecting fraudulent transactions due to its superior performance.
Feature Focus: Emphasize the important features (V14, V10, V4, V12, V17) for any future data collection or feature engineering.
Continuous Monitoring: Given the nature of fraud detection, it is advisable to continuously monitor the model’s performance and update it as new data becomes available.