Fraud Detection : Credit Card
There are lot of things are affected by Covid-19 pandemic. One of the area is online transaction, which is increased on a large scale. This automatically makes a drastic change in the credit card transaction usage. Fraudulent activity in credit card transaction is not a new problem but one can’t deny that, fraudulent activities is increased much more than before.
There are few noticeable challenges in Fraud detection if we see in the perspective of modeling technique. Some of them are listed below:
- Data imbalance(Biased class data)
- Data availability
- Explainability of ML models
There are other challenges as well. In this blog, we will talk about prediction model to detect the fraudulent cases. Dataset can be found here. There are few challenges one can face while solving this case study. Few of them are listed below:
Techniques to handle Data Imbalance problem:
- Using attribute “class_weight” in most of the sklearn classifier modeling libraries. For example:
xgb_model = XGBClassifier(class_weight='balanced')
Other than balanced one can pass dictionary that contains both class( in our case 0/1) weights. By default its value is None.
More precisely, we can understand the class weight calculation as per below formula:
w_1 = number_of_samples/(number_of_classes * number_of_samples_in_class_1)
I hope this will be helpful in understanding the class distribution well. Mostly “balanced” gives better result. If there are extreme cases then try to give weights manually. We have another technique as well i.e. SMOTE.
2. SMOTE : Synthetic Minority Oversampling Technique
As you go by its name, this conveys the technique that generates synthetic data for the minority class. It works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. If it is gong above your head then please refer below diagram:
The synthetic points are added between the chosen points and its neighbors. Please refer below code snippet for implementation:
# import libraries
import imblearn
from imblearn.over_sampling import SMOTE
smote = SMOTE()# Fit the predictor and target variable
X_smote, y_smote = smote.fit_resample(X, y)
Utilize X_smote and y_smote for futher modeling activity. If “imblearn” is not installed then please refer below snippet:
sudo pip install imbalanced-learn
This will resolve our one of the main problem of modeling.
Problem of Fraudulent behavior in the credit card transaction is a typical classification problem. In which we have to predict whether the particular case is Fraud or Not Fraud.
To solve this problem I have tried 3 different modeling techniques and tried the tuned version of all three. Before modeling any of the technique make sure data is scaled. Otherwise, this may leads to wrong interpretation of models’ feature importance.
- Logistic regression
- Make sure data is not correlated with each other. In case of correlated data, one must remove the correlated variable before modeling.
Refer below code snippet:
# LR Model
log_reg = LogisticRegression(class_weight='balanced')
reg_model = log_reg.fit(X_train, y_train)
2. Decision Tree
Refer below code snippet:
# Decision Tree Classifier Model
dt = DecisionTreeClassifier(max_depth=3, class_weight='balanced')
dt.fit(X_train, y_train)# Tuned model
dt_final = DecisionTreeClassifier(max_depth=3,
class_weight='balanced',
min_samples_leaf=100,
criterion='gini')
dt_final.fit(X_train, y_train)
3. XGBoost
Refer below code snippet:
# XGBoost model
xgb_model = XGBClassifier(class_weight='balanced')
xgb_model.fit(X_train, y_train)# Hyper parameter tuned model
xgb_final = XGBClassifier(class_weight='balanced',
learning_rate=0.2,
max_depth=4,
min_child_weight=11,
n_estimators=100)xgb_final.fit(X_train, y_train)
Models Summary:
We can visualize model summary below:
Conclusion:
It is tough to choose the better model because most of the models are having very good accuracy. Accuracy won’t be able to make good interpretation for this use case. As we have a biased/imbalanced class, so it would be better to use Precision and Recall for evaluating the model. Much more precisely then we should focus on Recall score.
As per our evaluation matrix we can go for XGBoost. We can also improve this score by doing more exploration on hyperparameter tuning and SMOTE technique for class imbalance problem which I have not used.
You can find complete code here.