Fraudulent Transaction Prediction in Bitcoin Network

Published in

ML45

10 min readDec 30, 2020

Objective

Bitcoin is a cryptocurrency which is a popular method for transactions. Among all the transactions that take place some of them are maybe used as malware attacks or as ransoms (referred to as illicit transactions). Our objective was to classify transactions into licit and illicit classes.

Motivation

As the earliest cryptocurrency to meet widespread popularity and success, Bitcoin has inspired a host of other projects in the blockchain space. Among all the transactions that take place, some of them are maybe used as malware attacks or as ransoms. Analysing illicit transactions and identifying characteristics which would lead to early identification of these transactions is an essential security measure.

State-of-the-art Methods

Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics: (Link to the paper)
This is the baseline paper for the Elliptic Dataset where the authors have described the dataset and have mentioned the purpose for the release of the dataset. The authors have also used Machine Learning techniques on the dataset. Another interesting methodology (state of the art) which they have provided is the use of Graph Convolutional Network and its variant EvolveGCN (which takes into account the temporality of the dataset) in predicting the illicit transactions. The code for EvolveGCN has been provided, however, some users on GitHub have reported their inability to replicate the results. GitHub Link

Dataset Description

Link: https://www.kaggle.com/ellipticco/elliptic-data-set

This anonymised data set is a transaction graph collected from the Bitcoin blockchain. A node in the graph represents a transaction; an edge can be viewed as a flow of Bitcoins between one transaction and the other. Each node has 166 features and has been labelled as being created by a “licit”, “illicit” or “unknown” entity. The graph is made of 203,769 nodes and 234,355 edges. 2% (4,545) of the nodes are labelled class1 (illicit). 21% (42,019) are labelled class2 (licit). The remaining transactions are not labelled with regard to licit versus illicit.

Distribution of classes across all the timesteps

There is also a time step associated with each node, representing a measure of the time when a transaction was broadcasted to the Bitcoin network. The first 94 features represent local information about the transaction. The remaining 72 features are aggregated features, obtained using transaction information one-hop backwards/forward from the centre node — giving the maximum, minimum, standard deviation and correlation coefficients of the neighbour transactions for the same information data. Interestingly, all the transactions have at least one value which has a z-score > 3 for the respective column.

Graph Analysis

Number of nodes: 203769
Number of edges: 234355
Average degree: 2.3002
Density: 1.1288341056918834e-05
Average Clustering: 0.013762190724244798

From the above stats, we observe that the graph is very sparse as the density is very less.
Moreover, the average clustering is also very less, which shows that the clustering present among the nodes in the graph is also very less.

Methodology

Preprocessing

Standard preprocessing on the data was performed which includes proper labelling, removing unknown values as well as Node IDs.
We tried different preprocessing strategies to handle outliers and for normalisation such as scaling each feature such that their mean is 0 and the standard deviation is 1. Detecting correlation in features and removing feature vectors showing >95% correlation with other feature vectors. Removing columns having values > 5*(Z-Score).

However, all these strategies resulted in worse performances. We hypothesize this may be because of the nature of the dataset where we do not have information about the nature of the features.

Methods used for classification

Dataset obtained after preprocessing was used for PCA and TSNE analysis, but both were unable to differentiate the two classes.
We trained various machine learning models, and the performances of some of them can be seen in the results section. We have tried SVM, Logistic Regression, XGBoost, Random Forest and Multi-layer perceptron.
The dataset has a lot of imbalance and to overcome the issue, which is causing models to perform poorly, we have implemented a few techniques as described below.

Undersampling: We perform stratified undersampling on the two classes by removing random observations from respective classes to create test and train sets.
Gaussian Mixture Model: GMM is an unsupervised learning algorithm which is similar to K-nearest neighbours but instead uses Expectation maximization to estimate normal distributions which fit the data. To prevent overfitting Bayesian Information Criterion parameter is used to estimate the number of components required. We found the optimal number of components to be 6.

Valley b/w 5 and 7.5 indicate 6 as a good choice for n_components

SMOTE: is a widely used algorithm for oversampling which oversamples the minority class by using an approach similar to K-nearest neighbour. We, however, found SMOTE to be inferior to GMM when it comes to estimating probability distribution and sampling from it.
CTGAN: We also explore CTGAN (Conditional GAN for Tabular Data), a GAN based synthesizer developed explicitly for generating tabular data. CTGAN uses fully connected networks and to prevent model collapse and stabilize learning it uses WGAN-GP (WGAN Gradient Penalty, which uses Wasserstein distance).

Clustering: We also perform unsupervised learning on the unlabelled transactions. We have used K-nearest neighbor (k = 4) and Gaussian Mixture Models (n_components = 4) for this purpose.
Embeddings using DeepWalk: Deepwalk is a novel approach to learn the representation of graphs by walking on vertices. We generated node embeddings using DeepWalk and trained Machine learning models on them. DeepWalk uses local information from its random walks to learn representations of the graph. The random walks can be thought of as small sentences of phrases. These random walks are essential in capturing local community information in the graph which ultimately give a better understanding of the features and the neighbourhood of a node. The DeepWalk algorithm then uses the information retrieved to generate the embeddings of the graph.

Representation of the DeepWalk algorithm

Experiment on the nature of graphs: Furthermore, we also analyzed the provided graph using networkx library to test a hypothesis which is ‘Do illicit transactions generally flows through a long chain of illicit transactions’.

Results and analysis

Random Forest performs the best for classification tasks. Among all the models we tried, Random Forest unequivocally gives the best results when the features provided by the dataset were used.
SVM performed the best in terms of accuracy but was not able to predict illicit transactions which can be owed to the heavy class imbalance in the dataset. Random Forest performed better on predicting illicit transactions but also predicted many false positives. Incidentally, since the dataset is concerned with Anti-money laundering, the main purpose of this exercise should be to minimize the false negatives and a tight threshold on the false positives.

Feature Importance of Random Forest Model

We extracted the features which are the most important for the random forest model. Upon trying to build a classifier with the top 20 features extracted as in fig.3, we ended up with worse results than the original. Thus we set Stratified sampling-based Random Forest as the baseline for our experiments.

Undersampling and Oversampling

Stratified sampling led to superior results compared to without undersampling which is expected because the class imbalance issue gets partially resolved. Oversampling using GMM vs SMOTE: We used t-SNE plots to compare the probability capturing ability of both the models.

t-SNE on SMOTE(left) and GMM(right) generated data

SMOTE tends to overfit to the data which would mean that it is trying to replicate the samples. However, it appears that there is a better capture of the probability distribution in the case of GMM. Augmenting the original matrix with the generated samples from illicit class resulted in amazing results.
GMM and ctGAN were used for synthesizing only the illicit class. Therefore, if the synthesizers failed to capture the probability distribution and synthesized samples which are significantly different from the original population, then there is an inherent bias that has been introduced in the dataset, where samples from 1 class are significantly different from the other class. Thus, since the population distribution from CTGAN is significantly different from the original population, the results for the same might be biased.

Clustering

The t-SNE plot of the unlabelled dataset is given below. As can be seen from the t-SNE plot, the dataset does not show very clear demarcations among the two classes. We performed GMM clustering and KNN clustering both of which were used in the further pipeline to build a classifier. The ROC curve for classification using RF classifier is shown below.

t-SNE plot and ROC curve of the model trained

Experiments on the nature of the graphs

We analyzed the graph at certain time steps and found that a lot of illicit transactions consisted of clusters tightly bound together. Now, the nodes represent transactions while edges refer to the flow of coins.

We hypothesize that illicit transactions generally are part of a larger flow of illicit transactions. This can somewhat be confirmed by looking at the visualizations of graphs with only the illicit transactions given below. We also hypothesize that these clusters are unique to illicit transactions only. We also hypothesize that these clusters are unique to illicit transactions only. These are manifested in the embeddings we obtained using DeepWalk, leading to better separability between the classes.

The dataset is temporal in nature. It is known that there was a dark shutdown that occurred in the later timesteps. Thus to explore this property we trained a Random Forest model for n-1 timesteps and tested on the nth time step, this is done n in the range 2, 49. We have plotted illicit F1 score vs timestep. The graph shows that there is a major dip in the illicit F-1 score showing that there is a significant amount of unpredictability introduced in the dataset after the 34th timestep.

Embeddings generated using DeepWalk

As hypothesized earlier, illicit transactions generally are part of a larger flow of illicit transactions. This would mean that using Graph Convolution or by generating representations of graphs, we could further prove our hypothesis.
Embeddings generated by DeepWalk were trained on SVM and Random Forest. SVM gave an accuracy of 95% which is the highest accuracy we have received. t-SNE and PCA visualization of the embeddings generated using DeepWalk give some indication to why SVM is able to perform well.

t-SNE and PCA on node embedding generated by DeepWalk

Both the plots show that illicit transactions are mostly clustered in the centre whereas the licit transactions are sort of scattered around The tSNE plot also shows smaller clusters within illicit transactions which can be further investigated if more information about the transactions is known. The performance of SVM can be attributed to its ability to transform that data into a higher dimension where the distinction between the classes becomes more clear.

Conclusion

We have applied a variety of machine learning techniques to the dataset so far. Synthetic data generated using GMM has a good distribution capture compared to the original data. There is temporality in the data which is explored using n-1 train and n test method. Also, the graph structure provides more insights from an abstraction point which we have explored in the context of connected illicit transactions but there still are possibilities which could yield a deeper understanding to the nature of the transactions. DeepWalk helped us to achieve our best results and further our hypothesis on the nature of illicit transactions. Although Random Forest performed better in terms of F1 score, graph features proved to be better identifiers of illicit transactions and SVM performed much better on the embeddings because the data is sparse making it harder for the Random Forest model to find distinctions in the features.

This work was our course project for the Machine Learning course at IIIT Delhi which was taught by Dr. Jainendra Shukla

Blog Authors

Bhavay Aggarwal, B.Tech. CSB, IIIT Delhi [LinkedIn]
Saad Ahmad, B.Tech. CSB, IIIT Delhi [LinkedIn]
Prasham Narayan, B.Tech. CSSS, IIIT Delhi [LinkedIn]

References

(1) Weber, M. et al. (2019) ‘Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics’, arXiv:1908.02591 [cs, q-fin].

(2) Perozzi, Bryan, et al. “DeepWalk: Online Learning of Social Representations.” Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining — KDD ’14, 2014, pp. 701–10. arXiv.org, doi:10.1145/2623330.2623732.

(3) Xu, Lei, et al. “Modeling Tabular Data Using Conditional GAN.” ArXiv:1907.00503 [Cs, Stat], Oct. 2019. arXiv.org, http://arxiv.org/abs/1907.00503.