Step-by-Step Guide: Credit Card Fraud Detection for Beginners Using XGBoost
Hi there! If you’re like me, you’ve probably struggled to find a complete, beginner-friendly guide for an end-to-end machine learning project. That’s why I decided to write this guide. Welcome to this step-by-step journey on detecting fraudulent credit card transactions using machine learning. We’ll go through each step together, from loading the data to evaluating our model. Let’s dive in and make this as straightforward as possible!
You can find the dataset we’ll use on Kaggle here: Credit Card Fraud Detection Dataset.
Step 1: Import Libraries
First, we need to import the necessary libraries.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
Step 2: Load the Dataset
Next, we’ll load the dataset and examine the first few rows to understand its structure.
# Load the dataset
data = pd.read_csv('/mnt/data/creditcard.csv')
data.head()
Output:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V20 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 ... -0.509655
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 ... -0.287924
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 ... -0.559825
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 ... -0.554088
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 -0.466751 ... -0.539627
V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.115862 -0.144233 0.004713 0.104091 0.021491 0.021050 -0.018307 0.277838 149.62 0
1 0.115046 -0.124201 0.053080 0.100683 0.068848 0.021490 0.021049 0.044742 2.69 0
2 -0.174580 0.051824 0.271170 0.065653 0.059334 0.021349 -0.007011 0.001565 378.66 0
3 0.017141 0.084379 0.314464 0.114338 0.065615 0.021123 0.021016 0.011654 123.50 0
4 -0.108300 0.005274 0.189115 0.133558 0.050680 0.021492 -0.018307 0.278197 69.99 0
[5 rows x 31 columns]
Step 3: Check for Missing Values
It’s essential to check for missing values in the dataset to ensure data quality.
# Check for missing values in the dataset
missing_values = data.isnull().sum()
missing_values
Output:
Time 0
V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 0
V23 0
V24 0
V25 0
V26 0
V27 0
V28 0
Amount 0
Class 0
dtype: int64
Step 4: Feature Scaling
We need to scale the ‘Amount’ and ‘Time’ features to bring them to a comparable range. We’ll use the StandardScaler
for this purpose.
# Feature scaling
scaler = StandardScaler()
data['scaled_amount'] = scaler.fit_transform(data['Amount'].values.reshape(-1, 1))
data['scaled_time'] = scaler.fit_transform(data['Time'].values.reshape(-1, 1))
data = data.drop(['Time', 'Amount'], axis=1)
data.head()
Step 5: Define Features and Target
# Define features and target
X = data.drop('Class', axis=1)
y = data['Class']
Step 6: Split the Data
We split the dataset into training and test sets to evaluate the model’s performance on unseen data.
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f'Training set size: {X_train.shape[0]}')
print(f'Testing set size: {X_test.shape[0]}')
Output:
Training set size: 227845
Testing set size: 56962
Step 7: Train the Model
We’ll use the XGBoost classifier for this task. It’s a powerful gradient boosting algorithm known for its efficiency and performance.
python# Initialize and train the XGBClassifier
model = XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss')
model.fit(X_train, y_train)
Step 8: Make Predictions
After training the model, we make predictions on the test set.
# Perform predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
Step 9: Evaluate the Model
We evaluate the model using various metrics, including the classification report, confusion matrix, and ROC AUC score.
# Evaluate the model
classification_rep = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)
print("Classification Report:")
print(classification_rep)
print("Confusion Matrix:")
print(conf_matrix)
print("ROC AUC Score:")
print(roc_auc)
Output:
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 56864
1 0.87 0.84 0.86 98
accuracy 1.00 56962
macro avg 0.94 0.92 0.93 56962
weighted avg 1.00 1.00 1.00 56962
Confusion Matrix:
[[56857 7]
[ 16 82]]
ROC AUC Score:
0.9526095710497782
Step 10: Plot the ROC Curve
Finally, we plot the ROC curve to visualize the model’s performance.
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, label=f'XGBoost (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
In this project, we aimed to build and evaluate a machine learning model to detect fraudulent credit card transactions using the XGBoost algorithm. The steps involved data preprocessing, feature scaling, model training, and evaluation.
Summary of Findings
- Data Overview:
- The dataset contains transactions made by European cardholders in September 2013.
- It includes 284,807 transactions with 31 features.
- Only 0.17% of transactions are fraudulent, indicating a highly imbalanced dataset.
2. Model Performance
Classification Report:
- Precision: High precision (0.87) for detecting fraud indicates that the majority of transactions predicted as fraudulent are indeed fraudulent.
- Recall: Recall (0.84) shows that the model is correctly identifying 84% of actual fraudulent transactions.
- F1-Score: The F1-score (0.86) balances precision and recall, providing a single metric for performance evaluation.
Confusion Matrix:
- True Positives (TP): 82
- True Negatives (TN): 56857
- False Positives (FP): 7
- False Negatives (FN): 16
ROC AUC Score:
- The ROC AUC score (0.95) demonstrates that the model is highly effective at distinguishing between fraudulent and non-fraudulent transactions.
Deep Analysis
Hyperparameter Tuning
XGBoost has several hyperparameters that can be fine-tuned to improve model performance. Techniques like Grid Search or Random Search can help find the optimal set of hyperparameters.
Key Hyperparameters:
- n_estimators: Number of boosting rounds.
- max_depth: Maximum depth of a tree, controlling model complexity.
- learning_rate: Step size shrinkage used to prevent overfitting.
- subsample: Fraction of samples used for fitting individual base learners.
- colsample_bytree: Fraction of features used for fitting individual base learners.
Model Comparison
Comparing the performance of different models can provide insights into the strengths and weaknesses of each approach.
Potential Models for Comparison:
- Logistic Regression: A simple yet effective model for binary classification tasks.
- Random Forest: An ensemble method that can handle imbalanced data and provide feature importance.
- Neural Networks: Deep learning models that can capture complex patterns in data.
By following these strategies and continuously improving the model, we can achieve a more robust and accurate fraud detection system, ultimately helping to prevent financial losses due to fraudulent transactions.