Step-by-Step Guide: Credit Card Fraud Detection for Beginners Using XGBoost

5 min readMay 23, 2024

Hi there! If you’re like me, you’ve probably struggled to find a complete, beginner-friendly guide for an end-to-end machine learning project. That’s why I decided to write this guide. Welcome to this step-by-step journey on detecting fraudulent credit card transactions using machine learning. We’ll go through each step together, from loading the data to evaluating our model. Let’s dive in and make this as straightforward as possible!

You can find the dataset we’ll use on Kaggle here: Credit Card Fraud Detection Dataset.

Step 1: Import Libraries

First, we need to import the necessary libraries.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

Step 2: Load the Dataset

Next, we’ll load the dataset and examine the first few rows to understand its structure.

# Load the dataset
data = pd.read_csv('/mnt/data/creditcard.csv')
data.head()

Output:

 Time    V1    V2    V3    V4    V5    V6    V7    V8    V9   V10  ...  V20  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599  0.098698  0.363787  0.090794  ... -0.509655   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803  0.085102 -0.255425 -0.166974  ... -0.287924   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461  0.247676 -1.514654  0.207643  ... -0.559825   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609  0.377436 -1.387024 -0.054952  ... -0.554088   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941 -0.270533  0.817739 -0.466751  ... -0.539627   

    V21   V22   V23   V24   V25   V26   V27   V28  Amount  Class  
0  0.115862 -0.144233  0.004713  0.104091  0.021491  0.021050 -0.018307  0.277838  149.62      0  
1  0.115046 -0.124201  0.053080  0.100683  0.068848  0.021490  0.021049  0.044742    2.69      0  
2 -0.174580  0.051824  0.271170  0.065653  0.059334  0.021349 -0.007011  0.001565  378.66      0  
3  0.017141  0.084379  0.314464  0.114338  0.065615  0.021123  0.021016  0.011654  123.50      0  
4 -0.108300  0.005274  0.189115  0.133558  0.050680  0.021492 -0.018307  0.278197   69.99      0  

[5 rows x 31 columns]

Step 3: Check for Missing Values

It’s essential to check for missing values in the dataset to ensure data quality.

# Check for missing values in the dataset
missing_values = data.isnull().sum()
missing_values

Output:

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

Step 4: Feature Scaling

We need to scale the ‘Amount’ and ‘Time’ features to bring them to a comparable range. We’ll use the StandardScaler for this purpose.

# Feature scaling
scaler = StandardScaler()
data['scaled_amount'] = scaler.fit_transform(data['Amount'].values.reshape(-1, 1))
data['scaled_time'] = scaler.fit_transform(data['Time'].values.reshape(-1, 1))
data = data.drop(['Time', 'Amount'], axis=1)
data.head()

Step 5: Define Features and Target

# Define features and target
X = data.drop('Class', axis=1)
y = data['Class']

Step 6: Split the Data

We split the dataset into training and test sets to evaluate the model’s performance on unseen data.

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f'Training set size: {X_train.shape[0]}')
print(f'Testing set size: {X_test.shape[0]}')

Output:

Training set size: 227845
Testing set size: 56962

Step 7: Train the Model

We’ll use the XGBoost classifier for this task. It’s a powerful gradient boosting algorithm known for its efficiency and performance.

python# Initialize and train the XGBClassifier
model = XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss')
model.fit(X_train, y_train)

Step 8: Make Predictions

After training the model, we make predictions on the test set.

# Perform predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

Step 9: Evaluate the Model

We evaluate the model using various metrics, including the classification report, confusion matrix, and ROC AUC score.

# Evaluate the model
classification_rep = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print("Classification Report:")
print(classification_rep)
print("Confusion Matrix:")
print(conf_matrix)
print("ROC AUC Score:")
print(roc_auc)

Output:

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.87      0.84      0.86       98

    accuracy                           1.00     56962
   macro avg       0.94      0.92      0.93     56962
weighted avg       1.00      1.00      1.00     56962

Confusion Matrix:
[[56857     7]
 [   16    82]]

ROC AUC Score:
0.9526095710497782

Step 10: Plot the ROC Curve

Finally, we plot the ROC curve to visualize the model’s performance.

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, label=f'XGBoost (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

In this project, we aimed to build and evaluate a machine learning model to detect fraudulent credit card transactions using the XGBoost algorithm. The steps involved data preprocessing, feature scaling, model training, and evaluation.

Summary of Findings

Data Overview:

The dataset contains transactions made by European cardholders in September 2013.
It includes 284,807 transactions with 31 features.
Only 0.17% of transactions are fraudulent, indicating a highly imbalanced dataset.

2. Model Performance

Classification Report:

Precision: High precision (0.87) for detecting fraud indicates that the majority of transactions predicted as fraudulent are indeed fraudulent.
Recall: Recall (0.84) shows that the model is correctly identifying 84% of actual fraudulent transactions.
F1-Score: The F1-score (0.86) balances precision and recall, providing a single metric for performance evaluation.

Confusion Matrix:

True Positives (TP): 82
True Negatives (TN): 56857
False Positives (FP): 7
False Negatives (FN): 16

ROC AUC Score:

The ROC AUC score (0.95) demonstrates that the model is highly effective at distinguishing between fraudulent and non-fraudulent transactions.

Deep Analysis

Hyperparameter Tuning

XGBoost has several hyperparameters that can be fine-tuned to improve model performance. Techniques like Grid Search or Random Search can help find the optimal set of hyperparameters.

Key Hyperparameters:

n_estimators: Number of boosting rounds.
max_depth: Maximum depth of a tree, controlling model complexity.
learning_rate: Step size shrinkage used to prevent overfitting.
subsample: Fraction of samples used for fitting individual base learners.
colsample_bytree: Fraction of features used for fitting individual base learners.

Model Comparison

Comparing the performance of different models can provide insights into the strengths and weaknesses of each approach.

Potential Models for Comparison:

Logistic Regression: A simple yet effective model for binary classification tasks.
Random Forest: An ensemble method that can handle imbalanced data and provide feature importance.
Neural Networks: Deep learning models that can capture complex patterns in data.

By following these strategies and continuously improving the model, we can achieve a more robust and accurate fraud detection system, ultimately helping to prevent financial losses due to fraudulent transactions.