Ethereum Fraud Detection with classification model

7 min readSep 4, 2022

As transactions related to cryptocurrencies are now very prevalent, only some handful currencies are able to stay actively in the market for a long period of time. Ethereum is in the top 10 of the market. As Ethereum is very popular, it is also common to have a lot of fraud and suspicious transactions. These fraud transactions also cause severe outcomes. However, it is difficult to seek justice compensations because cryptocurrencies have yet to subject to any government regulations. So if we can detect fraud for Ethereum’s smart contract transactions beforehand, it will really reduce the number of victims in crytocurrency market.

I downloaded the dataset from this site. I’m going to publish my code in github anyway. But I want you to read through my article beforehand because I will deliberately describe some details that may enlighten you.

The Ethereum fraud detection is classification problem with the imbalance class. So you should keep in mind that you have to do oversampling or undersampling technique to your data before put it in the model. In this article I will show you and compare the results of:

Classification without oversampling
Classification with oversampling
Classification with oversampling + Dimensionality reduction

So, let get started!!

Pre-processing data

Let get start by load data into the DataFrame and see some infomation. I didn’t capture all the features because, as you can see, it totally has around 50 features.

The data also has some duplicated rows

There are also many Null data. To handling it, I saw through the data and found that the rows with Null data have Null in many features. As I didn’t know how to calculate and impute all of them, I decided to just remove them.

I created the heatmap to see the correlation.

I also did some EDA. You can later see it in my published code. In this dataset, most of the features are numerical data. Only 2 features are categorical data. By the way, I see these two features may not useful so I decided to drop them out. You can see in Img5 that they have 305 and 467 categories respectively and most of them are just 0.

Img 5: Count plot of top 10 categories in two categorical features

Classification model

As I had told you earlier, I will show the results of:

Classification without oversampling
Classification with oversampling
Classification with oversampling + Dimensionality reduction

I choose the traditional model that range from very-old-but-popular model to model that very popular in this year (2022). I chose the model in this list:

DecisionTree
LogisticRegression
SVM with linear kernel
SVM with rbf kernel
RandomForest
GradientBoosting
XGBoost
LightGBM

Classification without oversampling

## split train/ test dataset
features = [col for col in df.columns if col != label]
x_train, x_test, y_train, y_test = train_test_split(df[features], df[label], test_size=0.30, random_state=17)## Create model
models = {'LogisticRegression' : LogisticRegression()
          ,'DecisionTree' : DecisionTreeClassifier()
          ,'LinearSVM' : LinearSVC()
          ,'rbfSVM' : SVC(kernel='rbf')
          ,'RandomForest' : RandomForestClassifier()
          ,'GradientBoosting' : GradientBoostingClassifier()
          ,'XGBoost' : XGBClassifier()
          ,'LightGBM' : LGBMClassifier()
         }## Evaluation
model_name_list = list(models)
accuracy_list = list()
tn_list = list()
fp_list = list()
fn_list = list()
tp_list = list()
precision_list = list()
recall_list = list()
f1_list = list()
auc_dict = dict()
fpr_dict = dict()
tpr_dict = dict()for _model_name, _model in tqdm(models.items()):
    classification_model = _model
    classification_model.fit(x_train, y_train)
    y_pred = classification_model.predict(x_test)
    
    acc_score = accuracy_score(y_pred, y_test)
    prec = precision_score(y_pred, y_test)
    rec = recall_score(y_pred, y_test)
    f1 = f1_score(y_pred, y_test)
    
    accuracy_list.append(acc_score)
    precision_list.append(prec)
    recall_list.append(rec)
    f1_list.append(f1)
    
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    tn_list.append(tn)
    fp_list.append(fp)
    fn_list.append(fn)
    tp_list.append(tp)
    
    if hasattr(classification_model, "decision_function"):
        y_score = classification_model.decision_function(x_test)
    else:
        y_score = classification_model.predict_proba(x_test)[:,1]
    fpr, tpr, _ = roc_curve(y_test, y_score)
    auc_score = auc(fpr, tpr)
    auc_dict[_model_name] = auc_score
    fpr_dict[_model_name] = fpr
    tpr_dict[_model_name] = tprprint('---------Finished---------')

I had trained model and got the evaluation matrix of:

Accuracy
Confusion matrix data (TP, FP, TN, FN)
Precision
Recall
F1
AUC
ROC curve

We have to talk in detail about the evaluation matrix that will fit to this dataset. The first is the accuracy. This is the very common matrix to evaluate performance of classification model. But when it comes to dataset that has very skewed distribution classes or imbalance classes, accuracy matrix may lead us to misinterpret the model performance. So it down to only F1 or AUC that will be our last choice.

F1 vs AUC

Calculate directly from precision and recall
Required prediction result
Suitable for imbalance dataset

AUC

Area under the ROC curve which is calculated from true positive rate and false positive rate (Note: true positive rate is not equal true positive TP and false positive rate is not false positive FP)
Required prediction probability (calculate by decision_function or predict_proba. You can see it in my code)
Unsuitable for imbalance dataset

So for this dataset, we will measure based on F1 score. If you see the result from the Img 6, the models with boosting perform are really better. Especially, XGBoost and LightGBM have very high F1 score and you can see that precision and recall are also very high. GradientBoosting and RandomForest have high F1 but for some reason still have not high low but not high precision. DecisionTree is the model that make me a little bit surprise that it has quite good result.

Img 6: Evaluation matrix for classification without oversampling

I also plot ROC curve to show you the result. You can see that the result converge to the table in Img 6.

Img 7: ROC curve of classification model without oversampling

Classification with oversampling

For imbalance dataset, every people in field of data will tell you to do sampling technique. Either oversampling or undersampling. So I chose SMOTE for this because it implement rather easy and it’s very popular.

from imblearn.over_sampling import SMOTEsmote_model = SMOTE(random_state = 10)
x_smote, y_smote = smote_model.fit_sample(x_train, y_train)
x_smote_test, y_smote_test = smote_model.fit_sample(x_test, y_test)

Img 8: Distribution of label before and after SMOTE

With SMOTE, you can see that F1 score on models that are already very good when training without sampling technique are better. I still don’t know why it’s way better but it’s really satisfy.

Img 9: Evaluation matrix for classification with oversampling

Img 10: ROC curve of classification model with oversampling

Classification with oversampling + Dimensionality reduction

This dataset also contains a lot of features. So I also want to know how dimensionality reduction technique impacts the performance. I chose PCA for dimensionality reduction. You can see in my code below that I did the scaler before because PCA is affect from difference in varience. You may doubt that you should do PCA before or after SMOTE. I also have this question in mind and after read through many article I think doing PCA before is the right decision. Because as you see that we have to scale the data before PCA because of effection from varience, it the same for SMOTE that make the difference in varience as well.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCAscaler = StandardScaler()
scaler.fit(x_train)x_train_scaler = scaler.transform(x_train)
x_test_scaler = scaler.transform(x_test)pca = PCA(.95)
pca.fit(x_train_scaler)x_train_pca = pca.transform(x_train_scaler)
x_test_pca = pca.transform(x_test_scaler)smote_model = SMOTE(random_state = 10)
x_pca_smote, y_pca_smote = smote_model.fit_sample(x_train_pca, y_train)
x_pca_smote_test, y_pca_smote_test = smote_model.fit_sample(x_test_pca, y_test)

The result is worse than when only using sampling technique.

Img 11: Evaluation matrix for classification with oversampling + dimension reduction

Img 12: ROC curve of classification model with oversampling + dimension reduction

Conclusion

Every dataset have thier own uniqeness, so you should use evaluation matrix that is suitable for that dataset. Like this dataset that I chose F1 score for selecting the best model.
Sampling technique makes better performance for imbalance dataset.
This may not be a conclusion but it is also very important. As we do data science projects, you have to see the word ‘science’ in the name. We do not know beforehand that what will be the best model. Just testing every models you think it will be good models and evaluate them. Other than model performance, you may have to consider the computational period. You may try to do hyperparameter tuning but if it required more time to train, you can choose not to do it.
You can see my code in my Github: https://github.com/RunnyKub/classification_problem/blob/main/Ethereum%20flaud%20classification%20with%20SKLearn/02%20-%20Ethereum%20flaud%20classification.ipynb