Ethereum Fraud Detection with classification model

Srun Sompoppokasest
7 min readSep 4, 2022

--

As transactions related to cryptocurrencies are now very prevalent, only some handful currencies are able to stay actively in the market for a long period of time. Ethereum is in the top 10 of the market. As Ethereum is very popular, it is also common to have a lot of fraud and suspicious transactions. These fraud transactions also cause severe outcomes. However, it is difficult to seek justice compensations because cryptocurrencies have yet to subject to any government regulations. So if we can detect fraud for Ethereum’s smart contract transactions beforehand, it will really reduce the number of victims in crytocurrency market.

I downloaded the dataset from this site. I’m going to publish my code in github anyway. But I want you to read through my article beforehand because I will deliberately describe some details that may enlighten you.

The Ethereum fraud detection is classification problem with the imbalance class. So you should keep in mind that you have to do oversampling or undersampling technique to your data before put it in the model. In this article I will show you and compare the results of:

  1. Classification without oversampling
  2. Classification with oversampling
  3. Classification with oversampling + Dimensionality reduction

So, let get started!!

Pre-processing data

Let get start by load data into the DataFrame and see some infomation. I didn’t capture all the features because, as you can see, it totally has around 50 features.

Img 1: Load data into DataFrame

The data also has some duplicated rows

Img 2: Dropping duplicated rows

There are also many Null data. To handling it, I saw through the data and found that the rows with Null data have Null in many features. As I didn’t know how to calculate and impute all of them, I decided to just remove them.

Img 3: Handling Null data

I created the heatmap to see the correlation.

Img 4: Correlation

I also did some EDA. You can later see it in my published code. In this dataset, most of the features are numerical data. Only 2 features are categorical data. By the way, I see these two features may not useful so I decided to drop them out. You can see in Img5 that they have 305 and 467 categories respectively and most of them are just 0.

Img 5: Count plot of top 10 categories in two categorical features

Classification model

As I had told you earlier, I will show the results of:

  1. Classification without oversampling
  2. Classification with oversampling
  3. Classification with oversampling + Dimensionality reduction

I choose the traditional model that range from very-old-but-popular model to model that very popular in this year (2022). I chose the model in this list:

  1. DecisionTree
  2. LogisticRegression
  3. SVM with linear kernel
  4. SVM with rbf kernel
  5. RandomForest
  6. GradientBoosting
  7. XGBoost
  8. LightGBM

Classification without oversampling

## split train/ test dataset
features = [col for col in df.columns if col != label]
x_train, x_test, y_train, y_test = train_test_split(df[features], df[label], test_size=0.30, random_state=17)
## Create model
models = {'LogisticRegression' : LogisticRegression()
,'DecisionTree' : DecisionTreeClassifier()
,'LinearSVM' : LinearSVC()
,'rbfSVM' : SVC(kernel='rbf')
,'RandomForest' : RandomForestClassifier()
,'GradientBoosting' : GradientBoostingClassifier()
,'XGBoost' : XGBClassifier()
,'LightGBM' : LGBMClassifier()
}
## Evaluation
model_name_list = list(models)
accuracy_list = list()
tn_list = list()
fp_list = list()
fn_list = list()
tp_list = list()
precision_list = list()
recall_list = list()
f1_list = list()
auc_dict = dict()
fpr_dict = dict()
tpr_dict = dict()
for _model_name, _model in tqdm(models.items()):
classification_model = _model
classification_model.fit(x_train, y_train)
y_pred = classification_model.predict(x_test)

acc_score = accuracy_score(y_pred, y_test)
prec = precision_score(y_pred, y_test)
rec = recall_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)

accuracy_list.append(acc_score)
precision_list.append(prec)
recall_list.append(rec)
f1_list.append(f1)

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
tn_list.append(tn)
fp_list.append(fp)
fn_list.append(fn)
tp_list.append(tp)

if hasattr(classification_model, "decision_function"):
y_score = classification_model.decision_function(x_test)
else:
y_score = classification_model.predict_proba(x_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_score)
auc_score = auc(fpr, tpr)
auc_dict[_model_name] = auc_score
fpr_dict[_model_name] = fpr
tpr_dict[_model_name] = tpr
print('---------Finished---------')

I had trained model and got the evaluation matrix of:

  1. Accuracy
  2. Confusion matrix data (TP, FP, TN, FN)
  3. Precision
  4. Recall
  5. F1
  6. AUC
  7. ROC curve

We have to talk in detail about the evaluation matrix that will fit to this dataset. The first is the accuracy. This is the very common matrix to evaluate performance of classification model. But when it comes to dataset that has very skewed distribution classes or imbalance classes, accuracy matrix may lead us to misinterpret the model performance. So it down to only F1 or AUC that will be our last choice.

F1 vs AUC

F1

  • Calculate directly from precision and recall
  • Required prediction result
  • Suitable for imbalance dataset

AUC

  • Area under the ROC curve which is calculated from true positive rate and false positive rate (Note: true positive rate is not equal true positive TP and false positive rate is not false positive FP)
  • Required prediction probability (calculate by decision_function or predict_proba. You can see it in my code)
  • Unsuitable for imbalance dataset

So for this dataset, we will measure based on F1 score. If you see the result from the Img 6, the models with boosting perform are really better. Especially, XGBoost and LightGBM have very high F1 score and you can see that precision and recall are also very high. GradientBoosting and RandomForest have high F1 but for some reason still have not high low but not high precision. DecisionTree is the model that make me a little bit surprise that it has quite good result.

Img 6: Evaluation matrix for classification without oversampling

I also plot ROC curve to show you the result. You can see that the result converge to the table in Img 6.

Img 7: ROC curve of classification model without oversampling

Classification with oversampling

For imbalance dataset, every people in field of data will tell you to do sampling technique. Either oversampling or undersampling. So I chose SMOTE for this because it implement rather easy and it’s very popular.

from imblearn.over_sampling import SMOTEsmote_model = SMOTE(random_state = 10)
x_smote, y_smote = smote_model.fit_sample(x_train, y_train)
x_smote_test, y_smote_test = smote_model.fit_sample(x_test, y_test)
Img 8: Distribution of label before and after SMOTE

With SMOTE, you can see that F1 score on models that are already very good when training without sampling technique are better. I still don’t know why it’s way better but it’s really satisfy.

Img 9: Evaluation matrix for classification with oversampling
Img 10: ROC curve of classification model with oversampling

Classification with oversampling + Dimensionality reduction

This dataset also contains a lot of features. So I also want to know how dimensionality reduction technique impacts the performance. I chose PCA for dimensionality reduction. You can see in my code below that I did the scaler before because PCA is affect from difference in varience. You may doubt that you should do PCA before or after SMOTE. I also have this question in mind and after read through many article I think doing PCA before is the right decision. Because as you see that we have to scale the data before PCA because of effection from varience, it the same for SMOTE that make the difference in varience as well.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaler = scaler.transform(x_train)
x_test_scaler = scaler.transform(x_test)
pca = PCA(.95)
pca.fit(x_train_scaler)
x_train_pca = pca.transform(x_train_scaler)
x_test_pca = pca.transform(x_test_scaler)
smote_model = SMOTE(random_state = 10)
x_pca_smote, y_pca_smote = smote_model.fit_sample(x_train_pca, y_train)
x_pca_smote_test, y_pca_smote_test = smote_model.fit_sample(x_test_pca, y_test)

The result is worse than when only using sampling technique.

Img 11: Evaluation matrix for classification with oversampling + dimension reduction
Img 12: ROC curve of classification model with oversampling + dimension reduction

Conclusion

  1. Every dataset have thier own uniqeness, so you should use evaluation matrix that is suitable for that dataset. Like this dataset that I chose F1 score for selecting the best model.
  2. Sampling technique makes better performance for imbalance dataset.
  3. This may not be a conclusion but it is also very important. As we do data science projects, you have to see the word ‘science’ in the name. We do not know beforehand that what will be the best model. Just testing every models you think it will be good models and evaluate them. Other than model performance, you may have to consider the computational period. You may try to do hyperparameter tuning but if it required more time to train, you can choose not to do it.
  4. You can see my code in my Github: https://github.com/RunnyKub/classification_problem/blob/main/Ethereum%20flaud%20classification%20with%20SKLearn/02%20-%20Ethereum%20flaud%20classification.ipynb

--

--