Illegal Bitcoin Transactions? Not on Graph ML’s Watch!
In this blog post, we explain the GCN-LPA model and apply it to detect illicit transactions on the Bitcoin blockchain. Our Google Colab Notebook can be accessed here.
By Alex Donovan, Grant Hugh, and Kazuki Mogi as part of the Stanford CS224W course project.
Introduction
Many types of data can be interpreted as graphs. Examples include computer networks, molecules, food webs, disease paths, social networks, and public transportation systems. Despite the abundance of graph-based data, traditional methods in machine learning often are not suited for application on graph-based data. Luckily, there is a rapidly growing field, graph machine learning, with many dedicated researchers who have devised wonderful machine learning methods specifically for graphs.
In this blog post, we explore one of those wonderful methods, GCN-LPA [1], a new graph machine learning technique proposed by Wang and Leskovec in 2020.
We hope that the reader gains a better understanding of GCN-LPA as well as graph ML techniques in general, and learns how to apply them to real world!
Table of Contents
Here’s a quick overview of what’s in this post:
- Explanation of GCN-LPA
- Description of the problem of detecting illicit Bitcoin transactions
- Code walkthrough of our implementation and application of GCN-LPA to the problem
- Discussion of our GCN-LPA model results
Graph ML Models: LPA, GCN, and GCN-LPA
In this section, we will first cover the mechanism behind LPA, then GCN, and finally GCN-LPA for the problem of node classification.
Notation Introduction
We begin by introducing relevant notation. Consider a graph G = (V, A, X, Y).
V = {v_₁, …, v_n} denotes the set of nodes in the graph.
A is the n by n adjacency matrix of the graph
X is the n by p feature matrix of nodes where the ith row represents the feature vector of node vᵢ.
Y = {y_₁, …, y_n} stores the labels of the nodes in the graph. Unlabelled node vᵢ has corresponding label yᵢ = 0.
Additionally, define Dᐨ¹ to be a diagonal matrix where the Dᵢᵢᐨ¹ is the reciprocal of the degree of node vᵢ.
The goal of node classification is to learn a mapping from V to L (our set of labels), and predict the labels of unlabeled nodes.
LPA
The Label Propagation Algorithm (LPA) [2] is a graph ML technique that can be used to classify nodes in a graph.
LPA makes the assumption that two nodes connected by an edge are likely to have the same label. Under this assumption, LPA propagates labels iteratively along the edges. This process can be formalized by the following equation
where Yᵏ is the soft label matrix after the kth iteration. We initialize row i of Y⁰ to be a one-hot label indicator vector if node vᵢ has a label and otherwise a vector of zeros. We repeat this k times. Before each iteration, we replace rows in Yᵏ with one-hot label indicator vectors for nodes whose labels are given to ensure that known ground-truth labels persist and are not overpowered by the label propagation algorithm.
GCN
A graph convolutional network (GCN) [3] is a specific instance of a graph neural network (GNN). GNN is a neural network model that modifies the idea of convolutional neural networks (CNN) to work with graph data and generate node embeddings. These node embeddings are used for downstream tasks such as node classification. For those not familiar with CNNs, please refer to this article or this paper. We often cannot directly apply CNNs to graphs because they have no fixed ordering or fixed notion of locality which are essential in CNNs as illustrated in the below figure.
A GNN overcomes this issue by using a node’s neighborhood as a computational graph and propagating information according to that computational graph. We give an example below in Fig. 3: for node A in the graph, its computational graph is represented by the diagram on the right. A is connected to its three neighbors B, C, and D, and each of these nodes is connected to their respective neighbors, and so on.
GCN, being a specific instance of GNN, uses this definition of a computational graph and computes node embeddings in the following steps:
- Sending messages: each node sends a message (its node embeddings) to each of its neighbor nodes
- Aggregation: each node takes all of the messages from its neighbors and takes the average.
- Updating: update the embeddings of each node based on the aggregation output.
The procedure above constitutes one layer in a GCN. Each additional layer adds another level of message passing, so with k layers, nodes that are k-hops away can influence each other’s embeddings. For instance, in Fig. 3, a 2-layer GCN would allow all nodes to influence all other nodes’ embeddings since all nodes in the input graph are 2-hops away from each other, as can be seen in the computational graph.
Using our notation, GCN can be formalized by the equation
where Xᵏ is the matrix of node embeddings at the kth layer and Wᵏ is a learnable weight matrix for the kth layer. Finally, we apply a sigmoid function to map values between 0 and 1.
GCN-LPA
One drawback of GCNs is that the aggregation function treats all neighbors equally (taking the mean), whereas we might want to weight the features of certain neighbors more heavily.
The GCN-LPA model introduces the idea of learning edge weights in addition to learning node embeddings. Specifically, we want to learn edge weights such that nodes from the same class are more strongly connected to each other, thus defining clearer node classes. This intuitively makes sense: a learned node embedding is good if all nodes of the same class have close embeddings, so by weighing intra-class edges (edges between two nodes of the same class) more heavily, we will naturally push them together in the GCN, resulting in better node classification.
How do we promote learning edge weights that emphasize intra-class edges though? One of Wang and Leskovec’s [1] key findings was that
“increasing the strength of edges between the nodes of the same class is equivalent to increasing the accuracy of LPA’s predictions”
Thus, the overarching premise for GCN-LPA is that LPA serves as regularization for GCN to learn edge weights. To see the mathematical proof for why this method works, please refer to Section 2 in the original paper.
Now that we have shown that increasing intra-class edge weights is equivalent to increasing LPA accuracy, we can use LPA in conjunction with normal GCN to define GCN-LPA. Simply put, in GCN-LPA, we want to find optimal node embeddings W* and edge weights A* by setting
The two L terms represent the cross-entropy loss functions for GCN and LPA, and λ is a balancing hyperparameter. So in the end, GCN-LPA has the added benefit of being relatively simple to implement.
Applying GCN-LPA to Detect Illicit Bitcoin Transactions
Now that we have covered the GCN-LPA model, let’s try applying it to a real-world scenario: identifying transactions as licit versus illicit on the Bitcoin blockchain. Don’t worry, you won’t need a thorough understanding of Bitcoin to follow along. The most important thing about the Bitcoin blockchain for us is that it can be nicely modeled as a directed graph. The nodes in this graph represent transactions posted to the blockchain, and edges represent the flow of Bitcoin between two transactions.
Our goal is to classify unlabelled nodes in a dataset as either licit or illicit. Since we have framed this as a node classification problem on a graph, we can try using a GCN-LPA model! In the following sections we will introduce the dataset we used and walk through our code applying a GCN-LPA model to tackle this problem.
Dataset
We use the Elliptic Bitcoin Data Set [4], which was contributed by Weber and their colleagues as a part of their paper at KDD 2019 [5]. This dataset can be interpreted as a directed graph in which nodes represent transactions and edges represent flows of Bitcoins from one transaction to another. Below is a visualization of a subset of the dataset represented as a graph:
Each transaction is either licit or illicit. Yellow nodes in the figure represent illicit transactions while purple nodes represent licit transactions. Some transactions in the data set are unlabeled and we remove these nodes from the data set in our pre-processing step. We also remove isolated nodes from the data set. After these steps, the data set contains 35,874 nodes and 36,624 edges. 2,484 nodes are labeled as illicit, which is approximately 7% of the total nodes.
Every node has 165 recorded features as well as a timestamp which tells us when the transaction was posted. There are 49 distinct timestamps in total and each timestamp contains a single connected component of transactions. We use the timestamps to subset the dataset due to resource constraints on Google Colab. Given this, we do not use the timestamps as features when training our models.
For more detailed information about the data set, please refer to Kaggle or the paper.
Code Walkthrough
Now let’s dive into our GCN-LPA model for classifying nodes as licit or illicit. We will be implementing a stacked GCN-LPA model where consecutive GCN layers are connected via batch normalization, ReLU, and dropout. The final output of the GCN-LPA model is passed through a Softmax layer to yield probabilistic estimates of the label.
We won’t be going through all of the code here, only the most important bits. Our full Colab notebook can be accessed here. Shoutout to PyTorch [6] and PyTorch Geometric [7] for powering our code!
Recall that for GCN-LPA, we simply need to implement a GCN component and an LPA component, which are tied together by common edge weight parameters and are updated simultaneously.
We start in the__init__
function our for our GCNLPA class defining the GCN component.
We first define all our convolutional layers in lines 5 and 6. Here we lean heavily on PyTorch Geometric’s GCNConv
, which already implements our GCN layers for us. We can treat this just like other layers in PyTorch, meaning that it suffices to provide some required information such as the input and output dimensions.
Then in line 7 we define numlayers-1
batch normalization layers so that we can do batch normalization in between all convolutional layers. Finally, we define a softmax function, a dropout percentage, a torch parameter to store our edge weights, and an integer k
to represent the number of iterations of LPA we do.
Now let’s move to the forward function, which tells us how to propagate data across our GCN and run an iteration of LPA.
In the GCN, first we pass the node features, the edge index, and the sigmoid of the edge weight into our convolutional layer to get a new node embedding x
. Then we do batch normalization, apply ReLU, and pass our x
through a dropout layer. We finally apply softmax to the resulting x
to squeeze it between 0 and 1.
Next we go to the second part of this forward function which implements LPA.
Here we first initialize the labels as one hot vectors. Next we lean on torch_geometric.utils.to_dense_adj
which takes our sparse adjacency matrix given by edge indices and edge attributes and returns to us the appropriate dense adjacency matrix. Then we run k iterations of LPA by repeatedly multiplying our matrix by our labels.
Now we are ready to train the GCN-LPA model. We do 100 trials of training, with each trial following the 3-step training procedure below:
- Sample a set of hyperparameters from a predefined hyperparameter space.
- Train the model using the sampled set of hyperparameters on the train set.
- Evaluate the trained model with the validation set.
- Save the model parameters.
Note that we use binary cross-entropy loss for both our GCN and LPA models, and we also set our GCN-LPA loss to be the GCN loss plus lambda times the LPA loss as discussed in the paper. Another quick thing to highlight is that the variable t is the cutoff for a node to be considered fraudulent (if embedding xₐ is greater than t for a node vₐ, then vₐ is classified as fraudulent).
Finally, in the following evaluation stage, we run 100 trials using objective_gcnlpa
defined above, and then we choose the trained model with the best validation performance. We evaluate this chosen model on the test set and report the F1 score, test accuracy, precision, and recall.
Discussion of Results
The table below summarizes the results in the node classification task into licit vs. illicit transactions. We found that GCN-LPA achieved superior performance in comparison to LPA and GCN (see Fig. 6 below). Recall that GCN-LPA improves GCN by strengthening intra-class edge weights via LPA. Thus, GCN-LPA could have performed better because our data satisfies the assumption that illicit transactions interact more with other illicit transactions.
However, it should be noted that the performance difference between GCN-LPA and GCN is relatively small. These results are in-line with what Wang and Leskovec found when comparing GCN-LPA to GCN on five citation/co-authorship datasets [1].
It is also noteworthy that the nature of the Elliptic Bitcoin Data Set is radically different from citation or co-authorship datasets used in the original paper by Wang and Leskovec. This could imply the versatility of GCN-LPA as a graph ML technique. To further explore this point, it would be interesting to see how GCN-LPA performs on other domains of node classification or how it performs on other tasks such as graph classification or community detection.
Conclusion
Our results demonstrate the power of GCN-LPA for detecting illicit transactions in Bitcoin transaction graphs. Earlier, we introduced LPA and GCN, models designed for classification tasks with graphs. GCN-LPA further improves upon the results of GCN by leveraging the strength of LPA to boost intra-class edge weights. We presented how GCN-LPA can be applied to the Elliptic Bitcoin Dataset, and this was made easy by using the versatility of PyTorch Geometric and PyTorch. GCN-LPA can similarly be applied to a wide variety of problems with graphical data.
If this article sparked your interest in graph ML, we recommend that you watch lectures by Professor Leskovec on YouTube and/or that you try out PyTorch Geometric!
References
[1] Hongwei Wang and Jure Leskovec. Unifying graph convolutional neural networks and labelpropagation.CoRR, abs/2002.06755, 2020.
[2] Zhu, X., Lafferty, J., and Rosenfeld, R. Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University, school of language technologies institute, 2005.
[3] Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, 2017.
[4] Elliptic, https://www.kaggle.com/ellipticco/elliptic-data-set
[5] M. Weber, G. Domeniconi, J. Chen, D. K. I. Weidele, C. Bellei, T. Robinson, C. E. Leiserson, “Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics”, KDD ’19 Workshop on Anomaly Detection in Finance, August 2019, Anchorage, AK, USA.
[6] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32 (pp. 8024–8035). Curran Associates, Inc.
[7] M. Fey and J. E. Lenssen. Fast Graph Representation Learning with PyTorch Geometric, 2019.