Fraud Detection on Bitcoin Transaction Graphs Using Graph Convolutional Networks

Published in

Stanford CS224W GraphML Tutorials

11 min readFeb 14, 2022

By Cynthia Li, Feiyang Liu, and Yuchen Wang, as part of the Stanford CS224W course project.

Overview

Graph Neural Networks (GNN) have attracted much attention in the machine learning community in recent years. It obtained promising results on a form of data that is more general and flexible than images or tables — graphs. In this post, we propose an application of Graph Convolutional Network (GCN) to detect fraudulent transactions on dynamic Bitcoin transaction graphs. In particular, we explored and compared four ways of splitting training / validation / test datasets: temporal split (inductive), temporal split by timestamps (inductive), random split (transductive), and community split (transductive).

We assume that readers are familiar with machine learning and deep learning concepts, and have a working knowledge of Pytorch.

Our full code can be found here.

Background and Problem Domain

Malicious Bitcoin transactions

Bitcoin is a decentralized and distributed digital currency that allows for peer-to-peer (P2P) transfer without the need of third-party authorization. All transactions of Bitcoins are transparent and recorded on the ledger in real time, while users were given cryptographic identifiers that anonymize their real identities. While its decentralized characteristic of Bitcoin offers low transaction fees that encourages financial inclusion, its pseudonymity also leads to risks of illegal activities like money laundering and underground market purchases[4]. During the process, the application may generate malicious transactions by injecting fake data not intended by the user, which cannot be undone after being digitally signed[1]. As a result, the user permanently loses some funds.

Bitcoin transaction graphs

Graph illustration of Bitcoin dataset. An edge from A to B means there is a Bitcoin flow from transaction A to transaction B.

The Bitcoin blockchain contains rich information for tracking and analyzing the transaction flow over time. Each transaction could contain multiple inputs and outputs. The use of multiple inputs corresponds to the use of multiple coins, and multiple outputs refers to different receivers. So the blockchain could be thought of as a large directed acyclic graph (DAG), where each transaction is a node and edges represent the flow of Bitcoins between transactions.

According to Battista et al.[4], a transaction transfers Bitcoins from a set of inputs to a set of outputs, each of which is associated with a cryptographic identifier and an amount, in such graphs. Two nodes A and B are connected by a directed edge (A, B) if one output of A is used as an input of B.

Dataset

We leverage the Elliptic Data Set, which maps Bitcoin transactions anonymously to a dynamic graph containing 203,769 nodes and 234,355 edges. Each node represents a transaction with 166 features — 94 features of local information (such as time step, transaction fee, number of inputs and outputs, output volume) and 72 features aggregated from neighbors. The time steps (ranging from 1 to 49) are evenly spaced with an interval of about two weeks. A time step also describes one connected subgraph representing transactions added into the blockchain within less than three hours between each other[3].

Notably, the graph can be viewed as dynamic, with a subset of nodes and edges added at each time step. However, no edge connects nodes of different time steps[3].

Dataset of each timestamp, where the black nodes are licit and the red ones are illicit

After removing unlabeled nodes, we have 4,545 nodes labelled illicit and 42,019 labelled licit, a decent size of data on which we can do the node classification task with graph neural networks.

Number of illicit and licit labels in nodes

Method

Graph Data In PyTorch Geometric

In this work, we store our graph as a torch_geometric.data.Data object in PyTorch Geometric. When defining the object, we specified its node features, node labels, and edge_index. edge_index defines the source and target nodes of all edges with the same node indexing as in node labels and node features.

Below is a sample of reading node feature and edge list and construct a torch_geometric.data.Data object (line 46):

CNNs vs GNNs

Graph Neural Network (GNN) has recently become popular for its ability to analyze graph-like data. A graph G is structured by two components: nodes V and edges E, and it is represented as G = (V, E). Unlike images or text that have spatial locality, graphs do not exist in Euclidean space. Thus, graphs cannot be represented by coordinate systems, which makes it hard to apply other existing deep learning algorithms like Convolutional Neural Network (CNN).

A CNN layer with a 3x3 filter could be viewed as the following: each node has a computation graph that aggregates messages from itself and the eight nodes that have plus or minus one difference from the center node’s coordinates. For each node in the input, the number of neighbors is fixed, and the ordering is fixed.

CNN layer convolves images with grid order while GNNs convolves graphs by node neighborhoods

In contrast to this, graphs do not have canonical orders of the nodes, and each node can be of different degrees. GNNs tackle the obstacles of CNNs by defining permutation invariant and equivariant computation graphs. Each node has an embedding vector usually initialized by its node features, and messages are aggregated from its neighbors.

Graph Convolutional Networks

Graph Convolutional Network (GCN) is a kind of GNNs that aggregate information from each node’s neighbors using neural networks. Each node passed through a GCN layer sums up the weighted message from neighboring nodes, normalized by its node degree. A sigmoid or ReLU activation function is optionally added to message or aggregation to involve some nonlinearity, which increases the model’s expressiveness.

Message passing with normalized adjacency matrix and aggregation. Image credit: Stanford CS224W course Powerpoint

The weight of messages from neighboring nodes can have different designing options. The most commonly used one is just the nodes embedding scaled by the degree of the central target node i as in the above figure to prevent exploding gradients in the network. Let H represent the feature matrix with node features on its columns, then the normalized feature matrix is D⁻¹AH where D is the diagonal degree matrix and A is the adjacency matrix.

Our implementation of message passing with diffusion matrix and aggregation

Another type of normalization is through diffusion matrix. Instead of using D⁻¹AH, we can use the normalized diffusion matrix D^-½AD^-½H. The intuition behind using diffusion matrix is that, if two nodes i and j are connected only with each other ( by a directed edge in our case), then the message passed between i and j will be large, while if i and j are also connected with other nodes, the message weight will be small. This allows softer messages passing among nearby nodes, which encourages nodes closely linked to be predicted to the same label.

For our Elliptic dataset, since it is very likely that Bitcoin flows from illicit transactions to another illicit transaction, and same for licit ones, we chose to use the diffusion matrix. Our implementation is the following (directed = False implies the use of diffusion matrix, while True would be normalized adjacency matrix):

Pipeline Construction

We adopt deep learning modules like batch normalization, dropout, activation functions, or attention mechanisms into our GCN architecture.

model architecture

Batch Normalization: Normalize over node embeddings to stabilize neural network training

Dropout: Randomly exclude some nodes from passing message with probability p during training, but include them during testing. Dropout acts as a regularization to prevent overfitting. For engineering convenience, dropout is usually applied before (except for the first) GCN layers.

Activation functions: Choices of ReLU or Sigmoid after each layer of message passing and aggregation. We used ReLU in our model.

In our implementation we used in total three GCN layers.

For our binary node classification task (predicting whether a transaction node is licit or illicit), we used cross entropy loss as our training and validation objectives. (In our implementation, we used negative log likelihood loss with a log softmax function, which is equivalent to cross entropy loss). We use the feature vectors of each node in the Elliptic data set, which has a dimension of 166, as our initial embeddings hᵤ⁰. For evaluation metrics, we adopt accuracy, F1-score and area under receiver operating characteristic curve (AUC-ROC), measuring how well the model performs from different aspects.

Dataset split and training

Splitting dataset into train/validation/test sets for graphs is different from other data structures, since nodes in different sets could be connected with edges, so they are not independently predicted due to message passing. One could choose to split the graph in a transductive or inductive setting. In the inductive splitting, edges connecting nodes in different sets are eliminated, so that each train/validation/test set is an independent graph. In contrast, in the transductive splitting, edges connecting nodes remain and participate in message passing for all nodes. The split happens after message passing and aggregation in this setting.

Left: Transductive setting; Right: Inductive setting. Image credit: Stanford CS224W course Powerpoint

In this project we compare different splitting of data in the below 4 ways.

For each way of splitting we trained for 50 epochs with an Adam optimizer with learning rate of 0.01. The results we show below are ones with best validation accuracy in each experiment.

Temporal group split into three graphs (inductive)

Financial data are usually temporal, so it should be reasonable to assume underlying temporal dynamics in the Bitcoin transaction graph[3]. It would be useful if the model could capture this dynamism, which is achieved by training on previous time steps and make predictions on subsequent time steps. For this purpose we tried the temporal split of the data into train/validation/test sets. We split nodes in the first 1 to 31 time steps into the training set, 32 to 36 for the validation set, and 37 to 49 for the test set.

In this setting each set has one graph of nodes and edges in the designated time step range. Since in the Elliptic dataset there is no edge between nodes in different time steps, there is no edge connecting the train/validation/test sets. This can be considered either as inductive or transductive split.

The training graph is used as input to the GCN during training time, and all three graphs are used during evaluation.

Temporal step split with time step subgraphs (inductive)

In this experiment we split all nodes in the same time step to an individual graph, and with the same time ranges as in the previous experiment, we split 31 graphs into training set, 5 graphs into validation set, and 13 graphs into testing set.

Again, since there is no edge between any two time steps, the three sets are independent in terms of message passing and node predictions.

When training with this splitting method, we pass in each graph in the training set to the GCN, and compute an average classification loss across the training set for back propagation. For evaluation, all evaluation metrics also become the average within each split set.

Random split (transductive)

Next we tried the random split with a transductive setting, i.e. for all train/validation/test sets the model can see all nodes and edges in the graph.

To make sure the three sets have similar number of nodes as in the previous two experiments, we split the data with train:validation:test ratio being roughly 0.6:0.1:0.3. The dataset constructed is a single graph and split_idx is a dictionary that specifies node index for each set.

During training/ validation/ testing, messages are passed and aggregated across the whole graph. Then the resulting node embeddings from the network are split into three sets according to split_idx. Below is an example during training:

Then the training set embeddings are used to compute loss for back propagation. All three sets of embeddings were evaluated at evaluation time.

Community split (transductive)

Lastly, We can also have a correlated split for the dataset. We first define a modified Laplacian for our directed, acyclic, not connected subgraph:

where D is the diagonal degree matrix of A’. As we reconstruct A’, A’ represents the adjacency matrix of a connected, undirected subgraph. (A weak component of the subgraph suffices to indicate the neighborhood of the nodes) and now L only has the smallest eigenvalue to be 0 (as the subgraph has only one connected component). Then we can perform an eigendecomposition to get k smallest non-zero eigenvalues and their corresponding eigenvectors.

Next, we consider the (|N| by k) matrix formed by these k eigenvectors where row i can represent an embedding of node i. Finally, we can perform a k-means clustering to form small community clusters of the graph and merge the communities to reach the number of nodes in the training set. In this way, the split allocates the nodes by leveraging a feature matrix generated from the graph networks and tends to group the nodes that are close to each other. (Takes about 2 minutes to compute)

In this transductive split we have only one graph as input to the network. Similar to random split, node indices for train/validation/test sets were specified in dictionary split_idx.

For this spliting the training and testing procedure is exactly the same as for random split, so we omit here.

Results and discussions

Comparison of different splitting methods

We compare AUC-ROC, F1, and accuracy scores of models with different ways of data splitting. As we can see from the table below, random split and community split result in much better performance on the test set than the two temporal splitting methods. Random split and community split performs very similarly on all scores. The model suffers from overfitting when we split the dataset based on time. We speculate that the time steps are overly sparsely sampled (each interval is about two weeks) so that data of different times hold a weak temporal relationship. However, the results also show that temporal group split does outperforms temporal step split with better AUC-ROC and F1 score. We think this serves as evidence that the model extracts helpful information from time steps when it serves as a feature. Since with the temporal step split all nodes in each of the subgraphs come from the same time step, the time step feature becomes futile. In contrast to this, with the temporal group split the time step feature is still active and could help the classification.

Reference

[1] Blaž Podgorelec, Muhamed Turkanovic ́, and Sašo Karakaticˇ. A machine learning-based method for automated blockchain transaction signing including personalized anomaly detection. Sensors, 20(1), 2020.

[2] Elliptic. www.elliptic.co.

[3] Mark Weber, Giacomo Domeniconi, Jie Chen, Daniel Karl I. Weidele, Claudio Bellei, Tom Robinson, and Charles E. Leiserson. Anti-money laundering in bitcoin: Experimenting with graph convolutional networks for financial forensics. ArXiv, abs/1908.02591, 2019.

[4] Giuseppe Di Battista, Valentino Di Donato, Maurizio Patrignani, Maurizio Pizzonia, Vincenzo Roselli, and Roberto Tamassia. Bitconeview: visualization of flows in the bitcoin transaction graph. In 2015 IEEE Symposium on Visualization for Cyber Security (VizSec), pages 1–8, 2015.