Fraud Seller Detection in an Online Marketplace: Application of Random Forest Algorithm

In this project, let’s develop a machine-learning model to identify those notorious frauds within a C-2-C marketplace

Koyel Chakraborty
Analytics in Market Research
6 min readJun 9, 2021

--

It was just a couple of weeks ago, a fraud seller blatantly cheated me on a renowned e-commerce website. Although it is not an unusual news in todays’ cyberspace, the incident compelled me to ask myself a serious question, ‘Should I trust this site anymore?’

Now, If I stop using this site (and if, eventually, everyone follows the same path), that would penalize the marketplace but not necessarily the seller who did me the actual harm. The cheater could still find a way to lure my fellow shoppers elsewhere. That sounds creepy, isn’t it? But, let’s keep our fear aside to dive deep into this problem.

About the Data:

The original dataset has been retrieved from Jeffrey Mvutu Mabilama on Kaggle (See License). It contains data regarding the sellers (and customers) of a C2C e-commerce website. To make this dataset suitable for our analysis, I have cleaned, preprocessed and made necessary changes to it (For more details, please find the code here). Note that, the column ‘Fraud’ does not exist in the original dataset. In the modified dataset, this column has been attached to denote if the sale was a fraudulent activity or not. (A sale has been termed to be fraud if the sent product mismatches the product description at the verification process before shipping.)

The modified dataset looks something like this (Alert: it is just an abridged preview, not the whole dataset),

About the Code:

Let’s import the required libraries first:

Let’s load the data:

>> Output:

The output cell prints a dataframe. As it can be difficult for you to comprehend the screenshot, I have transformed the dataframe into a csv file and embedded it here.

Notice two important points here. First of all, an unnecessary column ‘Unnamed: 0’ has formed. Besides, the ‘Fraud’ column (that would be our target variable in this project) has a mean way lower than 0.5. This column has only two unique values (1 for fraudulent sales activity, 0 otherwise). For this, a low mean indicates that the dataset is imbalanced and biased towards class 0. Let’s fix these problems one by one.

Delete the unnecessary column:

Specify the features and the target:

Perform the train-test split:

You may wonder why we are further dividing the test set into two separate parts. The reason will be clear to you later.

Fit the model:

Notice that, we have included the ‘class_weight’ parameter within the paramdict in order to tackle with the class 0 bias of the dataset. In this way, our model will be equipped enough to detect both the classes efficiently.[Suggested Read]

As we have explicitly defined our scoring metric, grid search will identify the best model based on the optimum AUC score found during its cross-validation process.[Suggested Read]

View the best set of parameters:

>>Output:

It is evident that, the best ‘min_samples_leaf’ parameter assumes the upper limit of the range defined by us in the paramdict. So, here is a possibility that a higher value of it may perform better if specified. For this, let’s increase its range and perform the grid search again.

Modify the set of parameters and fit again:

View the best set of parameters again:

>>Output:

Good news for us! All the best parameters here belong to our specified range even excluding their upper and lower limits. So, we may conclude that, this is the best model identified by grid search.

Print the classification report for the second test set:

>>Output:

Now, detecting only 69.77% (Recall for class 1.0) of frauds may not suffice our actual need, isn’t it? So, let’s follow a different route to increase the fraud detection rate further.

Selecting the threshold:

Let’s consider two adverse possibilities:

  1. False Negative: A seller is actually a fraud but remains undetected by our model.
  2. False Positive: Our model wrongly detects an innocent seller as a fraud.

Although both of the above cases indicate model failure, the first one appears to be more dangerous in its potential consequences. As a customer, we may not get harmed by avoiding an innocent seller since sufficient number of sellers would still be there to cater our needs. But, dealing with a fraud is one of the worst nightmares we can ever experience in our life. So, it feels safer to accept a higher number false positives for increasing class 1 Recall.

For this purpose, we can move the probability threshold to a more desired level. Let’s examine the ROC curve for class 1. To identify the optimum threshold, we are going to utilize the first test set preserved earlier.

In this way, we can get a dataframe that stores True Positive Rates and False Positive Rates with corresponding probability thresholds for class 1. Our motive is to maximize True Positive Rates. So, let’s plot the ROC curve for selecting the suitable threshold value for that. [Suggested Read]

>>Output:

The output cell generates an interactive 3-d plot. Here, I have attached two screenshots capturing notable aspects of it.

ROC Curve

To move the probability threshold, we need to check the ROC curve first. The ROC curve contains False Positive Rates (FPR) along the X-axis and True Positive Rates (TPR) along the Y-axis. Here, we need to find a maxima for TPR without increasing the FPR much. In a word, our task is to solve the TPR-FPR trade-off.

Finding the best points on ROC curve is quite subjective and depends on the FPR-aversion of the customer. For that, I have marked three points (notice red, orange and green strikes) on the curve. An individual who is more afraid of getting a false positive, may fix the FPR at around 36% (red strike). But, in that case, the person would be more vulnerable to frauds than people who accepted a higher FPR value (orange and green strikes). From my perspective, I feel comfortable to prefer the green strike as e-commerce sites generally offer plenty of options to buy from.

After you have selected your desired FPR, rotate the plot to view its corresponding probability threshold. It should look something like this,

ROC threshold selection

So, the preferred thresholds are (approximately),

Now, as we have already selected our probability threshold, let’s define a function to determine the Recall metric of class 1 for a different test set based on this threshold value.

We can call this function to calculate the new recall value for our second test set. This step is crucial to evaluate the performance of threshold shifting for improving prediction.

>>Output:

Finally, the recall for class 1 has increased to 90.11%. Undoubtedly, it is a huge leap from its initial value of 69.77%.

Interpretation:

Thus, our model successfully detected 90.11% of the total frauds present in an extrinsic test set. Although the performance may vary among different test sets, the detection rate seems quite satisfactory in this case.

Do you feel the same? Please post your views and queries in the comment section. I would love to hear from you as well.

Thanks for your precious attention. For more, follow my profile and publication page.🤍🤍

--

--

Koyel Chakraborty
Analytics in Market Research

Data Science Enthusiast | Areas of Interest: Market Research & Spatial Analytics | Website: https://analystkoyel.wixsite.com/geods