Machine Learning Project on Transaction Fraud detection

Classification algorithms using ML Pipeline

Pritisagar
10 min readAug 8, 2023

Machine learning, with its ability to analyze vast amounts of data and identify patterns, has emerged as a powerful tool in predicting transaction fraud. In this blog post, we will explore the key components of a machine learning pipeline for transaction fraud detection and delve into the various stages involved.

We will be building a machine learning model using Classification algorithms to accurately detect if a transaction is fraudulent or not. Classification Algorithms are used to predict discrete categories or classes for given input data. As we need to determine whether a transaction is fraudulent or not, Classification algorithms are best suited.

We will be using Google Colab to carry out this project.

First, we will be building ML model without using Pipeline. Here are the steps to be followed:

Importing Libraries:

Firstly, we need to install and import all the libraries to be used in the Project.

#Installing necessary libraries
!pip install catboost

#Importing necessary libraries
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

Getting the Data:

The next step in building a machine learning model for transaction fraud detection is collecting relevant data.

We can read the data directly from the given URL using given piece of code:

url = "https://drive.google.com/uc?export=download&confirm=6gh6&id=1VNpyNkGxHdskfdTNRSjjyNa5qC9u0JyV"
data = pd.read_csv(url)
df = data.copy()
df.head()

Sample dataframe is shown here:

The description of each column is given below:

  • step : Number of hours it took for a transaction to complete.
  • type : Type of transaction that took place. There are 5 categories in this column namely; ‘PAYMENT’, ‘TRANSFER’, ‘CASH_OUT’, ‘DEBIT’, ‘CASH_IN’ .
  • amount : Total amount of transaction
  • nameOrig : Name/ID of the Sender.
  • oldbalanceOrg: Sender balance before the transaction took place.
  • newbalanceOrg : Sender balance after the transaction took place.
  • nameDest : Name/ID of the Recipient.
  • oldbalanceDest : Recipient balance before the transaction took place.
  • newbalanceDest : Recipient balance after the transaction took place.
  • isFraud : This is the transaction made by the fraudulent agents inside the simulation.
  • isFlaggedFraud : The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200000 in a single transaction.

Once the data is collected, we need to preprocess it for analysis. This involves tasks such as data cleaning, handling missing values, and encoding categorical variables. Additionally, feature engineering techniques can be applied to extract meaningful features from the raw data, enabling the machine learning algorithms to better detect patterns and anomalies.

Exploratory Data Analysis(EDA)

EDA plays a crucial role in understanding the characteristics of the dataset and identifying potential insights. By visualizing and analyzing the data, patterns and relationships between variables can be uncovered.

We can see above that the dataset contains a total of 11 columns and more than 6 million rows.

# Checking missing values
df.isnull().sum().sum()

# Output: 0

It can be observed here that the dataset is highly imbalanced with 8213 Fraud cases and 6354407 non-Fraud cases. Synthetic Minority Oversampling Technique (SMOTE) or Undersampling Techniques can be used to resolve this issue. . Undersampling is not used in real- world scenarios because it will lead to loss of data. Here, we are not dealing with the SMOTE for simplification.

Feature Engineering:

With potentially hundreds or even thousands of features available, it is important to select the most relevant ones for training the machine learning model. Feature selection techniques, such as correlation analysis and feature importance ranking, help identify the features that contribute the most to fraud detection. This step not only improves the model’s efficiency but also reduces the risk of overfitting.

import seaborn as sns
import matplotlib.pyplot as plt
correlation_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()
threshold = 0.7  

correlated_features = []
for i in range(len(correlation_matrix.columns)):
for j in range(i):
if abs(correlation_matrix.iloc[i, j]) > threshold:
col_i = correlation_matrix.columns[i]
col_j = correlation_matrix.columns[j]
correlated_features.append((col_i, col_j, correlation_matrix.iloc[i, j]))

Output will show us the most correlated features.

Training the Model:

The next step involves training a machine learning model using the preprocessed data. Various algorithms can be employed, including logistic regression, decision trees, random forests, support vector machines, or neural networks. The choice of algorithm depends on the characteristics of the data and the specific requirements of the business.

To evaluate the performance of the model, it is necessary to split the dataset into training and testing sets. We use the training set to train the model

  • We set our X variable by dropping the ‘isFraud’ column because that is our target column.
  • The ‘isFraud’ column is subsequently assigned to our y variable.
#Splitting the dataset
X = data.drop('isFraud', axis=1) # Features
y = data['isFraud'] # Target variable

We use the training set to train the model and logistic regression, Decision Tree, XGBoost , CatBoost was then applied to the training set.

categorical_cols = X.select_dtypes(include=['object']).columns

for col in categorical_cols:
le = LabelEncoder()
X[col] = le.fit_transform(X[col])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

# Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

# XGBoost
xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)

# CatBoost
catboost_model = CatBoostClassifier()
catboost_model.fit(X_train, y_train)

Analyzing performance of trained model:

Validation of the model involves assessing its performance on an independent dataset or during a defined period.

The testing set is used to assess its performance on unseen data. Common evaluation metrics for fraud detection models include accuracy, precision, recall, and F1-score. However, due to the imbalanced nature of fraud detection datasets, additional metrics like Area Under the Precision-Recall Curve (AUPRC) or Receiver Operating Characteristic (ROC) curve are often employed to provide a more comprehensive evaluation.

  • Using the evaluate_model function, we evaluate the model.
# Model evaluation
def evaluate_model(model, X, y):
y_pred = model.predict(X)
accuracy = accuracy_score(y, y_pred)
precision = precision_score(y, y_pred)
recall = recall_score(y, y_pred)
roc_auc = roc_auc_score(y, y_pred)
return accuracy, precision, recall, roc_auc

lr_accuracy, lr_precision, lr_recall, lr_roc_auc = evaluate_model(lr_model, X_test, y_test)
dt_accuracy, dt_precision, dt_recall, dt_roc_auc = evaluate_model(dt_model, X_test, y_test)
xgb_accuracy, xgb_precision, xgb_recall, xgb_roc_auc = evaluate_model(xgb_model, X_test, y_test)
catboost_accuracy, catboost_precision, catboost_recall, catboost_roc_auc = evaluate_model(catboost_model, X_test, y_test)

# Print the evaluation metrics
print("Logistic Regression - Accuracy:", lr_accuracy, "Precision:", lr_precision, "Recall:", lr_recall, "ROC AUC:", lr_roc_auc)
print("Decision Tree - Accuracy:", dt_accuracy, "Precision:", dt_precision, "Recall:", dt_recall, "ROC AUC:", dt_roc_auc)
print("XGBoost - Accuracy:", xgb_accuracy, "Precision:", xgb_precision, "Recall:", xgb_recall, "ROC AUC:", xgb_roc_auc)
print("CatBoost - Accuracy:", catboost_accuracy, "Precision:", catboost_precision, "Recall:", catboost_recall, "ROC AUC:", catboost_roc_auc)

Output we get is:

Now, we are going to implement the ML model using Pipeline.

Final Machine Learning Pipeline:

Machine learning pipeline consists of multiple sequential steps that do everything from data extraction and preprocessing to model training and deployment.

Here are the explained steps:

Importing libraries:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

Then, we need to define ML pipeline. The defined machine learning pipeline consists of several steps, each contributing to the overall process of transaction fraud detection. Let’s break down the pipeline and explain each step:

  1. Data Preprocessing : The first step is to preprocess the data using a ColumnTransformer. This transformer applies specific preprocessing steps to different types of features. Numerical features are standardized using StandardScaler, which scales the features to have zero mean and unit variance. Categorical features are left unchanged using the 'passthrough' option.
numerical_features = [ 'amount', 'nameOrig' ,'oldbalanceOrg' ,  'newbalanceOrig' , 'nameDest' , 'oldbalanceDest', 'newbalanceDest']
categorical_features = ['step','type']
preprocessor = ColumnTransformer(
transformers=[
('numerical', StandardScaler(), numerical_features), # Apply standardization to numerical features
('categorical', 'passthrough', categorical_features), # Preserve categorical features
]
)

2. Feature Selection : Next, a feature selection step is applied using SelectKBest. This step selects the top k features based on their ANOVA F-value. In this case, the top 7 features are chosen for further analysis and model training.

feature_selector = SelectKBest(score_func=f_classif, k=7)  # Select top 7 features using ANOVA F-value

3. Model Initialization : The pipeline defines four classifier models for fraud detection: LogisticRegression, DecisionTreeClassifier, XGBClassifier, and CatBoostClassifier. These models will be trained and evaluated to determine their effectiveness in detecting fraudulent transactions.

lr_model = LogisticRegression()
dt_model = DecisionTreeClassifier()
xgb_model = XGBClassifier()
catboost_model = CatBoostClassifier()

4. Pipeline Construction : The pipeline is constructed by combining the preprocessing, feature selection, and classifier steps in a sequential manner. For each model, the pipeline consists of the following steps:

a. Preprocessor: Applies the data preprocessing steps defined earlier.

b. Feature Selector: Selects the top features based on the ANOVA F-value.

c. Classifier: Utilizes one of the initialized classifiers to make predictions.

# Build the pipeline
pipeline_lr = Pipeline(steps=[
('preprocessor', preprocessor),
('feature_selector', feature_selector),
('classifier', lr_model)
])

pipeline_dt = Pipeline(steps=[
('preprocessor', preprocessor),
('feature_selector', feature_selector),
('classifier', dt_model)
])

pipeline_xgb = Pipeline(steps=[
('preprocessor', preprocessor),
('feature_selector', feature_selector),
('classifier', xgb_model)
])

pipeline_catboost = Pipeline(steps=[
('preprocessor', preprocessor),
('feature_selector', feature_selector),
('classifier', catboost_model)
])

5. Model Training : The constructed pipelines are then fitted to the training data (X_train and y_train) using the fit method. This process trains the models on the provided training data, allowing them to learn patterns and relationships between features and fraud labels.

# Fit the pipelines
pipeline_lr.fit(X_train, y_train)
pipeline_dt.fit(X_train, y_train)
pipeline_xgb.fit(X_train, y_train)
pipeline_catboost.fit(X_train, y_train)

6. Evaluation : To evaluate the performance of each pipeline, a separate evaluation function is defined. This function takes a pipeline, test data (X_test and y_test), and calculates several evaluation metrics, including accuracy, precision, recall, and ROC AUC (Area Under the Receiver Operating Characteristic Curve).

# Evaluate the pipelines
def evaluate_pipeline(pipeline, X, y):
y_pred = pipeline.predict(X)
accuracy = accuracy_score(y, y_pred)
precision = precision_score(y, y_pred)
recall = recall_score(y, y_pred)
roc_auc = roc_auc_score(y, y_pred)
return accuracy, precision, recall, roc_auc

lr_accuracy, lr_precision, lr_recall, lr_roc_auc = evaluate_pipeline(pipeline_lr, X_test, y_test)
dt_accuracy, dt_precision, dt_recall, dt_roc_auc = evaluate_pipeline(pipeline_dt, X_test, y_test)
xgb_accuracy, xgb_precision, xgb_recall, xgb_roc_auc = evaluate_pipeline(pipeline_xgb, X_test, y_test)
catboost_accuracy, catboost_precision, catboost_recall, catboost_roc_auc = evaluate_pipeline(pipeline_catboost, X_test, y_test)

# Print the evaluation metrics for the pipelines
print("Logistic Regression - Accuracy:", lr_accuracy, "Precision:", lr_precision, "Recall:", lr_recall, "ROC AUC:", lr_roc_auc)
print("Decision Tree - Accuracy:", dt_accuracy, "Precision:", dt_precision, "Recall:", dt_recall, "ROC AUC:", dt_roc_auc)
print("XGBoost - Accuracy:", xgb_accuracy, "Precision:", xgb_precision, "Recall:", xgb_recall, "ROC AUC:", xgb_roc_auc)
print("CatBoost - Accuracy:", catboost_accuracy, "Precision:", catboost_precision, "Recall:", catboost_recall, "ROC AUC:", catboost_roc_auc)

By following this machine learning pipeline, organizations can develop an effective fraud detection system that leverages the power of machine learning algorithms to identify and mitigate transaction fraud risks.

The whole code can be summarized as:

numerical_features = [ 'amount','oldbalanceOrg' ,  'newbalanceOrig' , 'oldbalanceDest', 'newbalanceDest']
categorical_features = ['step','type']

# Define the pipeline steps
preprocessor = ColumnTransformer(
transformers=[
('numerical', StandardScaler(), numerical_features), # Apply standardization to numerical features
('categorical', 'passthrough', categorical_features), # Preserve categorical features
]
)

feature_selector = SelectKBest(score_func=f_classif, k=7) # Select top 7features using ANOVA F-value

lr_model = LogisticRegression()
dt_model = DecisionTreeClassifier()
xgb_model = XGBClassifier()
catboost_model = CatBoostClassifier()

# Build the pipeline
pipeline_lr = Pipeline(steps=[
('preprocessor', preprocessor),
('feature_selector', feature_selector),
('classifier', lr_model)
])

pipeline_dt = Pipeline(steps=[
('preprocessor', preprocessor),
('feature_selector', feature_selector),
('classifier', dt_model)
])

pipeline_xgb = Pipeline(steps=[
('preprocessor', preprocessor),
('feature_selector', feature_selector),
('classifier', xgb_model)
])

pipeline_catboost = Pipeline(steps=[
('preprocessor', preprocessor),
('feature_selector', feature_selector),
('classifier', catboost_model)
])

# Fit the pipelines
pipeline_lr.fit(X_train, y_train)
pipeline_dt.fit(X_train, y_train)
pipeline_xgb.fit(X_train, y_train)
pipeline_catboost.fit(X_train, y_train)

# Evaluate the pipelines
def evaluate_pipeline(pipeline, X, y):
y_pred = pipeline.predict(X)
accuracy = accuracy_score(y, y_pred)
precision = precision_score(y, y_pred)
recall = recall_score(y, y_pred)
roc_auc = roc_auc_score(y, y_pred)
return accuracy, precision, recall, roc_auc

lr_accuracy, lr_precision, lr_recall, lr_roc_auc = evaluate_pipeline(pipeline_lr, X_test, y_test)
dt_accuracy, dt_precision, dt_recall, dt_roc_auc = evaluate_pipeline(pipeline_dt, X_test, y_test)
xgb_accuracy, xgb_precision, xgb_recall, xgb_roc_auc = evaluate_pipeline(pipeline_xgb, X_test, y_test)
catboost_accuracy, catboost_precision, catboost_recall, catboost_roc_auc = evaluate_pipeline(pipeline_catboost, X_test, y_test)

# Print the evaluation metrics for the pipelines
print("Logistic Regression - Accuracy:", lr_accuracy, "Precision:", lr_precision, "Recall:", lr_recall, "ROC AUC:", lr_roc_auc)
print("Decision Tree - Accuracy:", dt_accuracy, "Precision:", dt_precision, "Recall:", dt_recall, "ROC AUC:", dt_roc_auc)
print("XGBoost - Accuracy:", xgb_accuracy, "Precision:", xgb_precision, "Recall:", xgb_recall, "ROC AUC:", xgb_roc_auc)
print("CatBoost - Accuracy:", catboost_accuracy, "Precision:", catboost_precision, "Recall:", catboost_recall, "ROC AUC:", catboost_roc_auc)

After evaluating the model’s performance, it is essential to optimize its hyperparameters to achieve better results. Techniques like grid search or randomized search can be employed to find the optimal combination of hyperparameters for the chosen algorithm. It is crucial to perform this optimization process while preventing overfitting by using cross-validation techniques like k-fold cross-validation.

Conclusion

By collecting, preprocessing, and analyzing data, selecting relevant features, training and evaluating models, and continuously monitoring and adapting, organizations can develop robust fraud detection systems.

Thanks for reading the article! Hope you learnt a lot, If You Like My Content , you can follow me on medium.

--

--