Enhancing Credit Card Fraud Detection: A Dive into Data Augmentation

József Dudás
10 min readSep 18, 2023

--

Introduction

Credit card fraud detection is an ongoing battle for businesses and financial institutions worldwide. With the sophistication of fraudulent activities on the rise, it is paramount to stay ahead in detecting and countering these threats.

In an era where data drives decision-making, the project’s core lies in harnessing the power of data augmentation techniques. The purpose? To enhance the diversity of our training data, making our models more adept at generalizing and identifying fraudulent patterns with even higher precision. However, the journey to enhanced detection is not without its challenges — from ensuring the unbiased nature of the augmented data to addressing essential ethical concerns related to data privacy.

In this article, I will shed light on our approach, the state-of-the-art techniques we plan to employ, and the potential implications of this project. Join me as we embark on this fascinating journey to push the boundaries of fraud detection, one synthetic data point at a time.

Technical specification

The project seeks to amplify the precision of credit card fraud detection by harnessing advanced data augmentation methods. The core goal revolves around countering the difficulties arising from skewed datasets, where authentic transactions far outnumber fraudulent ones, causing biased outcomes and diminished precision in model predictions.

Our outlined approach encompasses stages of data acquisition, refinement, and delving into contemporary data augmentation methods, namely SMOTE, ADASYN, and GANs. These methods are tailored to create synthetic representations of underrepresented classes — in this case, fraudulent transactions — thereby enriching the training data’s diversity and enhancing model adaptability.

We plan to test various machine learning techniques, including but not limited to Logistic Regression, Random Forest, SVM, and GBM. The training will occur on both the original skewed and enriched datasets, gauging performance metrics such as accuracy, precision, recall, F1 score, and the area beneath the ROC curve.

Central to our approach is an emphasis on ethics, guaranteeing data confidentiality, and rectifying any potential biases in the enhanced data. We’ll also dive deep into the interpretability and clarity of the chosen models, providing a clearer understanding of the fraud detection mechanisms.

Through the assimilation of the refined model into our current fraud detection framework, this endeavor aims to bolster security in the financial domain, ensuring a heightened shield against credit card fraud for both consumers and financial entities.

Dataset

The dataset is stored in a CSV format and encompasses 284,807 entries, with each entry denoting a distinct credit card transaction. The data consists of numerous obfuscated columns labeled from V1 to V28, along with ‘Amount’ and ‘Class’ columns. The specifics of columns V1 to V28 remain unknown to us, but their exact nature is not essential for this project. This ambiguity aids in safeguarding data integrity and ensuring user privacy. The ‘Amount’ column captures the transaction value, while the ‘Class’ column designates whether a transaction was fraudulent (indicated by the value 1) or a regular valid transaction (represented by the value 0).

Utilizing the Pandas library, we imported the CSV file, examined the dataset, searched for any missing entries, and delved into the distribution of values across each column.

Our analysis revealed that the dataset predominantly contains numeric values, spanning columns V1 through V28 as well as the ‘Amount’ column. However, these values operate on varying scales, indicating the necessity of applying a scaling technique before model training.

Further, upon inspecting the ‘Class’ column, we saw that 284,315 transactions were marked as valid transactions, and only 492 were flagged as fraudulent. So we can confirm, that we have a highly imbalanced dataset.

Training on the original imbalanced dataset

Before delving into our data augmentation experiments, it’s essential to first train the models on the existing imbalanced dataset to establish a benchmark.

For streamlined processing, we employed the scikit-learn library. This library not only encompasses the desired models we aim to evaluate — including Logistic Regression, Random Forest, State Vector Classification, and Gradient Boosting Classifier — but also offers comprehensive functions for extracting the performance metrics we’re targeting, such as Accuracy, Precision, Recall, etc.

A side note: In our specific environment, model training was noticeably sluggish. To address this and optimize CPU utilization, we integrated the scikit-learn-intelex patching library, enhancing the performance of the primary scikit-learn library.

We created some re-usable functions:

def load_and_preprocess_data(filepath):
"""
Load dataset and preprocess the features.

Parameters:
- filepath: Path to the dataset CSV file.

Returns:
- Tuple containing feature matrix and target vector.
"""
data = pd.read_csv(filepath)
features = ['V' + str(i) for i in range(1, 29)] + ['Amount']
X = data[features]
y = data['Class']

scaler = StandardScaler() # Because of 'Amount', we need scaling
X = scaler.fit_transform(X)

return X, y
def split_data(X, y, test_size=0.2):
"""
Split data into training and testing sets.

Parameters:
- X: Feature matrix.
- y: Target vector.
- test_size: Proportion of the dataset to be used as test set.

Returns:
- Tuple containing training and testing sets.
"""
return train_test_split(X, y, test_size=test_size, random_state=42, stratify=y) # stratify=y will maintain the ratio of ones and zeros in both the training and test datasets.
def train_model(X_train, y_train, model_type):
"""
Initialize and train a model based on the specified model type.

Parameters:
- X_train: Training feature matrix.
- y_train: Training target vector.
- model_type: Type of model to train. Accepted values: 'logistic_regression', 'random_forest', 'svm', 'gbm'.

Returns:
- Trained model.
"""
if model_type == 'logistic_regression':
model = LogisticRegression(max_iter=1000)
elif model_type == 'random_forest':
model = RandomForestClassifier(n_estimators=100)
elif model_type == 'svm':
model = SVC(probability=True) # Enable probability for ROC curve
elif model_type == 'gbm':
model = GradientBoostingClassifier()
else:
raise ValueError(f"Model type '{model_type}' not recognized.")

model.fit(X_train, y_train)
return model
def evaluate_model(model, X_test, y_test):
"""
Evaluate a trained model on test data.

Parameters:
- model: Trained machine learning model.
- X_test: Testing feature matrix.
- y_test: Testing target vector.

Returns:
- Tuple containing evaluation metrics: accuracy, precision, recall, f1, and roc_auc.
"""
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_prob)

return accuracy, precision, recall, f1, roc_auc
def plot_roc_curve(y_test, y_pred_prob):
"""
Plot ROC curve for model predictions.

Parameters:
- y_test: True target values.
- y_pred_prob: Model predicted probabilities.
"""
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

And then we run the training:

def main():
filepath = 'https://xxxxxxxxxxxxx/creditcard.csv'
X, y = load_and_preprocess_data(filepath)
X_train, X_test, y_train, y_test = split_data(X, y)

# List of models
model_types = ['logistic_regression', 'random_forest', 'svm', 'gbm']

for model_type in model_types:
model = train_model(X_train, y_train, model_type)
metrics = evaluate_model(model, X_test, y_test)
print(f"{model_type.replace('_', ' ').title()} Metrics:\n")
print(f"Accuracy: {metrics[0]:.2f}")
print(f"Precision: {metrics[1]:.2f}")
print(f"Recall: {metrics[2]:.2f}")
print(f"F1 Score: {metrics[3]:.2f}")
print(f"ROC AUC: {metrics[4]:.2f}\n\n")
plot_roc_curve(y_test, model.predict_proba(X_test)[:, 1])

Results on imbalanced dataset

As anticipated, due to the imbalanced dataset, the results were not that impressive.

In this section, I’ll present a brief overview of the results. A comprehensive comparison and detailed findings will be elaborated upon in the article’s conclusion.

Logistic Regression Metrics:

Accuracy: 1.00
Precision: 0.83
Recall: 0.65
F1 Score: 0.73
ROC AUC: 0.96

Random Forest Metrics:

Accuracy: 1.00
Precision: 0.96
Recall: 0.77
F1 Score: 0.85
ROC AUC: 0.92

Data augmentation

We delved into cutting-edge data augmentation methods, including SMOTE, ADASYN, and GANs. These approaches are tailored to produce synthetic instances of the underrepresented class — in our case, fraudulent transactions. By doing so, they enrich the variety of the training dataset and bolster the model’s ability to generalize effectively.

SMOTE (Synthetic Minority Over-sampling Technique):

SMOTE is a popular data augmentation technique primarily designed to tackle the issue of class imbalance. It works by creating synthetic samples in the feature space. For any instance in the minority class, SMOTE selects ‘k’ nearest neighbors, and depending on the amount of oversampling needed, it generates synthetic instances between the chosen instance and its neighbors. Essentially, it draws lines between the feature vectors of the selected instance and its neighbors and creates new points along those lines, thus bolstering the minority class’s representation.

ADASYN (Adaptive Synthetic Sampling):

ADASYN takes a slightly modified approach compared to SMOTE. The core idea of ADASYN is to use a weighted distribution for different minority class instances according to their level of difficulty in learning. In other words, those minority class instances that are harder to learn have more synthetic data generated. ADASYN calculates the number of synthetic samples required for each minority instance based on its difficulty level, thus ensuring that the classifier focuses more on areas where it’s challenging to distinguish between classes.

GANs (Generative Adversarial Networks):

In the realm of data augmentation, GANs have taken a unique spot. GANs comprise two neural networks, the Generator and the Discriminator, which are trained simultaneously through adversarial training. The Generator aims to generate data, while the Discriminator tries to distinguish between genuine and synthetic data. Over time, the Generator becomes proficient in producing data that are almost indistinguishable from real data. In the context of data augmentation, GANs can be used to create synthetic instances of minority classes, ensuring that the generated data captures the intricate patterns and complexities of the original data, aiding in improving model generalization and performance.

To experiment with SMOTE and ADASYN, we utilized the imbalanced-learn library in Python. This library encompasses a wide array of oversampling techniques, including the aforementioned ones.

On the imbalanced dataset, we can apply these using these functions:

from imblearn.over_sampling import SMOTE

def apply_smote(X, y):
"""
Apply SMOTE to generate synthetic samples for minority class.

Parameters:
- X: Feature matrix.
- y: Target vector.

Returns:
- Resampled feature matrix and target vector.
"""
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
return X_resampled, y_resampled
from imblearn.over_sampling import ADASYN

def apply_adasyn(X, y):
"""
Apply ADASYN to generate synthetic samples for minority class.

Parameters:
- X: Feature matrix.
- y: Target vector.

Returns:
- Resampled feature matrix and target vector.
"""
adasyn = ADASYN(random_state=42)
X_resampled, y_resampled = adasyn.fit_resample(X, y)
return X_resampled, y_resampled

As you can see, implementing these techniques is quite straightforward. However, deploying GAN demands a more hands-on coding approach. Using Generative Adversarial Networks to augment data in CSV format is a more specialized application of GANs. Most GAN tutorials focus on image, audio, or video generation, but the underlying principles can often be applied to structured data like those in CSV files.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

def apply_gan(X, y, num_steps=10000):
"""
Apply GAN to generate synthetic samples for minority class.

Parameters:
- X: Feature matrix.
- y: Target vector.

Returns:
- Resampled feature matrix and target vector.
"""

# 1. Extract the minority class samples
X_minority = X[y == 1]
X_minority_tensor = torch.FloatTensor(X_minority)

# 2. Define the GAN model layers
generator_layers = nn.Sequential(
nn.Linear(in_features=100, out_features=128),
nn.ReLU(),
nn.Linear(in_features=128, out_features=X_minority.shape[1])
)

discriminator_layers = nn.Sequential(
nn.Linear(in_features=X_minority.shape[1], out_features=128),
nn.ReLU(),
nn.Linear(in_features=128, out_features=1),
nn.Sigmoid() # sigmoid => real or fake
)

criterion = nn.BCELoss()
optimizer_g = optim.Adam(generator_layers.parameters(), lr=0.001)
optimizer_d = optim.Adam(discriminator_layers.parameters(), lr=0.0001)

# 3. Train GAN
for step in range(num_steps):
# Train Discriminator
optimizer_d.zero_grad()
real_labels = torch.ones(len(X_minority_tensor), 1)
fake_data = generator_layers(torch.randn(len(X_minority_tensor), 100))
fake_labels = torch.zeros(len(X_minority_tensor), 1)

logits_real = discriminator_layers(X_minority_tensor)
logits_fake = discriminator_layers(fake_data.detach())

loss_real = criterion(logits_real, real_labels)
loss_fake = criterion(logits_fake, fake_labels)
loss_d = loss_real + loss_fake
loss_d.backward()
optimizer_d.step()

# Train Generator
optimizer_g.zero_grad()
logits_fake = discriminator_layers(fake_data)
loss_g = criterion(logits_fake, real_labels)
loss_g.backward()
optimizer_g.step()

# 4. Generate synthetic samples
num_synthetic_samples = len(X[y == 0]) - len(X_minority)
with torch.no_grad():
synthetic_samples = generator_layers(torch.randn(num_synthetic_samples, 100)).numpy()

# Append the synthetic samples to X and adjust y accordingly
X_resampled = np.vstack([X, synthetic_samples])
y_resampled = np.hstack([y, [1]*num_synthetic_samples])

return X_resampled, y_resampled

Step-by-step breakdown for GAN when applied to an imbalanced dataset:

  1. Data Extraction:
    You’ll first extract samples of the minority class. The idea is to teach the GAN to generate more samples that look like these minority samples.
X_minority = X[y == 1]
X_minority_tensor = torch.FloatTensor(X_minority)

Here, `X_minority` contains all rows of data corresponding to the minority class (fraudulent transactions). Each row has all 29 features.

2. Generator and Discriminator Input/Output Definitions:
Generator:
Input: A noise vector (usually drawn from a normal distribution). The size of the noise vector can vary, but for simplicity, let’s assume it’s of size 100.
Output: A synthetic data sample that has all 29 features. The generator’s job is to make this output indistinguishable from the real minority samples.

Discriminator:
Input: A data sample with all 29 features. This could be a real sample from `X_minority` or a synthetic one from the generator.
Output: A single value (between 0 and 1) indicating the discriminator’s belief about whether the input sample is real (from the dataset) or fake (from the generator).

3. Training the GAN:
For each training iteration:
Discriminator Training:
1. Get a batch of real samples from `X_minority_tensor`.
2. Generate a batch of synthetic samples using the generator.
3. Feed both the real and synthetic samples to the discriminator.
4. Calculate the loss for the discriminator, considering real samples should be classified as 1 and synthetic samples as 0.
5. Update the discriminator’s weights using backpropagation.

Generator Training:
1. Generate a batch of synthetic samples.
2. Feed the synthetic samples to the discriminator.
3. Calculate the loss for the generator, considering that it wants the discriminator to believe these synthetic samples are real (i.e., the generator aims for the discriminator to output 1s for these samples).
4. Update the generator’s weights using backpropagation.

4. Generating Synthetic Samples:
Once training is deemed sufficient, use the trained generator to produce synthetic samples.

noise = torch.randn(num_synthetic_samples, 100)
with torch.no_grad():
synthetic_samples = generator(noise).numpy()

5. Appending to the Dataset:
Combine the original dataset with the synthetic samples.
Adjust the target array accordingly (adding 1s for each synthetic sample since they are supposed to represent the minority class).

So, in essence, for the vanilla GAN, all 29 features of the credit card dataset are utilized during training and generation. The GAN learns to generate synthetic samples that resemble real samples from the minority class, using all 29 features.

Results

Upon implementing various data augmentation methods and assessing the model’s performance for each technique individually, the following outcomes were observed:

Results

Conclusion

From the observed outcomes, it’s evident that every model experienced notable improvements due to the data augmentation processes. However, the most striking performance enhancement came from the application of the sophisticated Generative Adversarial Network (GAN).

--

--