Fraud Detection : Credit Card

Nirmal Kumar
Analytics Vidhya
Published in
3 min readJul 18, 2021

There are lot of things are affected by Covid-19 pandemic. One of the area is online transaction, which is increased on a large scale. This automatically makes a drastic change in the credit card transaction usage. Fraudulent activity in credit card transaction is not a new problem but one can’t deny that, fraudulent activities is increased much more than before.

There are few noticeable challenges in Fraud detection if we see in the perspective of modeling technique. Some of them are listed below:

  1. Data imbalance(Biased class data)
  2. Data availability
  3. Explainability of ML models

There are other challenges as well. In this blog, we will talk about prediction model to detect the fraudulent cases. Dataset can be found here. There are few challenges one can face while solving this case study. Few of them are listed below:

Techniques to handle Data Imbalance problem:

  1. Using attribute “class_weight” in most of the sklearn classifier modeling libraries. For example:
xgb_model = XGBClassifier(class_weight='balanced')

Other than balanced one can pass dictionary that contains both class( in our case 0/1) weights. By default its value is None.

More precisely, we can understand the class weight calculation as per below formula:

w_1 = number_of_samples/(number_of_classes * number_of_samples_in_class_1)

I hope this will be helpful in understanding the class distribution well. Mostly “balanced” gives better result. If there are extreme cases then try to give weights manually. We have another technique as well i.e. SMOTE.

2. SMOTE : Synthetic Minority Oversampling Technique

As you go by its name, this conveys the technique that generates synthetic data for the minority class. It works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. If it is gong above your head then please refer below diagram:

SMOTE technique

The synthetic points are added between the chosen points and its neighbors. Please refer below code snippet for implementation:

# import libraries
import imblearn
from imblearn.over_sampling import SMOTE

smote = SMOTE()
# Fit the predictor and target variable
X_smote, y_smote = smote.fit_resample(X, y)

Utilize X_smote and y_smote for futher modeling activity. If “imblearn” is not installed then please refer below snippet:

sudo pip install imbalanced-learn

This will resolve our one of the main problem of modeling.

Problem of Fraudulent behavior in the credit card transaction is a typical classification problem. In which we have to predict whether the particular case is Fraud or Not Fraud.

To solve this problem I have tried 3 different modeling techniques and tried the tuned version of all three. Before modeling any of the technique make sure data is scaled. Otherwise, this may leads to wrong interpretation of models’ feature importance.

  1. Logistic regression
  • Make sure data is not correlated with each other. In case of correlated data, one must remove the correlated variable before modeling.

Refer below code snippet:

# LR Model
log_reg = LogisticRegression(class_weight='balanced')
reg_model = log_reg.fit(X_train, y_train)

2. Decision Tree

Refer below code snippet:

# Decision Tree Classifier Model
dt = DecisionTreeClassifier(max_depth=3, class_weight='balanced')
dt.fit(X_train, y_train)
# Tuned model
dt_final = DecisionTreeClassifier(max_depth=3,
class_weight='balanced',
min_samples_leaf=100,
criterion='gini')
dt_final.fit(X_train, y_train)

3. XGBoost

Refer below code snippet:

# XGBoost model
xgb_model = XGBClassifier(class_weight='balanced')
xgb_model.fit(X_train, y_train)
# Hyper parameter tuned model
xgb_final = XGBClassifier(class_weight='balanced',
learning_rate=0.2,
max_depth=4,
min_child_weight=11,
n_estimators=100)
xgb_final.fit(X_train, y_train)

Models Summary:

We can visualize model summary below:

Summary of all 3 models

Conclusion:

It is tough to choose the better model because most of the models are having very good accuracy. Accuracy won’t be able to make good interpretation for this use case. As we have a biased/imbalanced class, so it would be better to use Precision and Recall for evaluating the model. Much more precisely then we should focus on Recall score.

As per our evaluation matrix we can go for XGBoost. We can also improve this score by doing more exploration on hyperparameter tuning and SMOTE technique for class imbalance problem which I have not used.

You can find complete code here.

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Nirmal Kumar
Nirmal Kumar

Written by Nirmal Kumar

Software Engineer | ML&AI, Data Engineering(Hadoop : Hive, Pig, Sqoop), Python, Java, RDBMS(MySQL, PostgresQL)