Extracting Insights from a Dataset using Supervised Learning

Jasmi Kevadia
INST414: Data Science Techniques
2 min readMay 3, 2023

As the world becomes increasingly data-driven, extracting insights from large datasets is becoming more important than ever. In this post, I will demonstrate how I applied supervised learning techniques to extract insights from a dataset.

Dataset of interest:

My dataset of interest is the Kaggle Credit card fraud detection dataset. This dataset contains credit card transactions that occurred over a two-day period in September 2013 by European cardholders. It contains a total of 284,807 transactions, out of which only 492 are fraudulent. Therefore, this datset has a lot of imbalances.

Type of Supervision problem:

The type of supervision problem I will be solving is classification, where I will be classifying transactions as either fraudulent or non-fraudulent.

Features and ground-truth labels:

The features used for classification are the transaction amount, time, and 28 anonymized features labeled as V1 through V28. The ground-truth labels for each transaction were obtained through human supervision and represent whether the transaction was fraudulent or not.

Choice of Supervised learning model:

The supervised learning model I chose for this problem is the Random Forest Classifier. I chose this model because it is effective in handling imbalanced datasets and has a low risk of overfitting. Additionally, Random Forest can handle a large number of features and is relatively fast to train and predict.

Model evaluation:

To determine whether the model was performing well or not, I used precision, recall, F1 score, and ROC AUC score as evaluation metrics. The model achieved a precision of 97%, recall of 82%, F1 score of 89%, and ROC AUC score of 94%.

Misclassified samples:

After testing the model, I identified five transactions that were misclassified. Upon closer inspection, I found that the model misclassified these transactions because they had high transaction amounts, which the model had not encountered during training.

Software Used:

I used Python and its scikit-learn library for this analysis.

Data Cleaning:

I performed some data cleaning to remove missing values and normalize the transaction amount and time features. Additionally, I used over-sampling to balance the dataset.

Summary of Findings:

In summary, I was able to use supervised learning techniques to classify credit card transactions as either fraudulent or non-fraudulent. The model achieved high performance in identifying fraudulent transactions with a low rate of false positives. However, the model misclassified transactions with high transaction amounts, indicating that more feature changes may be necessary to improve performance.

Limitations:

One limitation of this analysis is that the dataset is highly imbalanced, with only 492 fraudulent transactions out of 284,807 transactions. Therefore, more sophisticated techniques may be needed to handle imbalanced data. Additionally, the dataset only covers two days of transactions and may not be representative of all credit card transactions.

Conclusion:

Overall, this analysis demonstrates the power of supervised learning techniques in extracting insights from datasets. With proper feature engineering and model selection, it is possible to achieve high performance in identifying fraudulent transactions in credit card transactions.

Github: https://github.com/jasmi01/INST414Exercises/blob/main/assignment6

--

--