Applying graph ML to classify (il)licit bitcoin transactions

Alex Ke

Published in

Stanford CS224W GraphML Tutorials

9 min readFeb 9, 2022

By Alex Ke and Andy Hwang as part of the Stanford CS224W course project. Colab here to reproduce our results.

Motivation

In 2011, a StarCraft tournament had inadvertently made its losers into millionaires,

At the time, 25 Bitcoin was worth a reasonable $41.25, but today 5th-8th place walked home with $1.2 million. Bitcoin’s meteoric rise, among other cryptocurrencies, has awoken worldwide technological and entrepreneurial interest in payment processing on the blockchain. These champion the cause of financial inclusion and started to win hearts with academics and even policy advocates, like the president of El Salvador:

However, bitcoin’s pseudonymity which has protected its users has also started to be used by criminals to hide in plain sight. Money transfer for ransomware or dark marketplaces has never been easier.

On the other hand, the transparent data available on the blockchain gives more power to investigators and allows for the crowdsourcing of forensics. Anti-money laundering using this data has then taken off, with the promise of keeping crypto safe and legal which would solidify its legitimacy.

Elliptic Data Set

To study money laundering ourselves, we leverage the Elliptic Data Set, consisting of over 200K bitcoin transactions with total value of $6 billion. These transactions were labeled by Elliptic as made by illicit parties (scams, ransomware, terrorist organizations, etc.) or licit (exchanges, miners, wallet services, etc.). Although this dataset may be large, this graph is a sliver of the broader bitcoin transaction graph (there are around 300K new transactions per day).

We can interpret the bitcoin blockchain as a directed acyclic graph (DAG) where nodes represent transactions and edges between nodes represent the flow of bitcoins between one transaction and another. In this view, the only nodes that do not have incoming edges are coinbase transactions which rewards miners for mining a new block. Additionally, each transaction broadcasts a timestamp to the bitcoin network, which adds a time component to this graph (Figure 1).

Figure 1. Example of a DAG initiated by a coinbase transaction. Credit to Elliptic.

Exploratory Data Analysis

The Elliptic Data Set consists of 203,769 nodes and 234,355 edges. Two percent (4,545) of the nodes are labelled illicit and twenty-one percent (42,019) are labelled licit, while the remaining transactions are unknown. Each transaction also contains a timestamp from 1 through 49, spaced by approximately 2 weeks, representing a single connected component of transactions within three hours of each other. Figure 2 decomposes our data classes over time, and we note a significant regime shift on timestep 43 where a dark market was shut down, which caused much fewer illicit transactions.

Figure 2. Number of transactions per class over time. Licit transactions are well correlated (0.855) with the number of total transactions, whereas illicit transactions have no correlation (0.679).

Since each timestep’s connected components are independent of successive timestamps, we choose to neglect the temporal element such that our models can more effectively learn structures that are effective across all timesteps. Figure 3 displays a visualization of the subgraph induced by labeled nodes in a given timestep.

Figure 3. Labeled node induced subgraph for timestep 27.

The features csv is anonymized, such that it does not have column names. 93 of these features provide transaction-level features, such as “number of inputs/outputs, transaction fee, output volume and aggregated figures such as average BTC received (spent) by the inputs/outputs and average number of incoming (outgoing) transactions associated with the inputs/outputs.” The final 72 features represent one-hop information about the transaction “giving the maximum, minimum, standard deviation and correlation coefficients of the neighbour transactions for the same information data (number of inputs/outputs, transaction fee, etc.).”

Classical ML Techniques

Because this task is binary classification with class imbalance, we opt for F1 score as our evaluation metric, as in the debut paper (Weber et al. (2019)). Because our eventual application is detecting future illicit transactions, we use a 60:20:20 temporal split of training and testing data, where the first 29 timesteps are for training, and final 20 are randomly split between validation and testing (to overcome the regime shift).

Table 1. Classical ML Results. Note we diverge from Elliptic because we split our dataset differently.

As an array of baselines, we test Logistic Regression, Random Forest, Naive Bayes, and KNN classifiers on sets of the local (transaction-level) features and all features. As expected, the Random Forest trained on all features proves to be a formidable baseline with 0.83 as F1 score using all features (see Table 1).

Graph ML Techniques

Graph ML has recently become one of the hottest research topics, inspired by developments in spectral graph theory. To more naturally leverage the data’s graph structure, we apply graph neural networks to this transaction graph, where features are initialized with the original 165 provided features.

Graph Convolutional Network

Unlike applying normal convolutions onto an image, graphs do not have an ordering of their nodes. The key innovation of Kipf and Welling 2016 was in introducing a permutation invariant convolution where the neighborhood of nodes form its own computation graph (Figure 4). Here, the depth of a graph convolutional network corresponds to the number of hops of neighbors which contribute to the computation graph.

Figure 4. The 2-hop neighborhood of node A (left) defines its 2 layer computation graph (right). The boxes represent transformations that share parameters regardless of target node. Credit to CS224W.

We can view that each node in the computation graph passes a message up the network, which is aggregated by parent nodes. In the simplest case, we can aggregate the neighbors’ messages through averaging and then apply a simple one-layer perceptron (with nonlinearity) to add expressivity.

Mathematically, if we let A represent the adjacency matrix, D to be the diagonal node degree matrix, H^{(k}} to be the layer-k representations of the nodes, and W^{(k)} to be the layer-k weight matrix, and \sigma to be a nonlinearity, the operation we’ve just described can be written

The Graph Convolutional Network of Kipf and Welling 2016 takes this further by introducing self-loops into A, setting \hat{A} = A + I, such that the layer k embedding for a node can more easily inform the layer k+1 embedding. The second modification is using a symmetric normalization of D^{-1/2} A D^{-1/2} instead of the simple averaging D^{-1} A. Thus their propagation operation can be written

Finally, we implement a stack of these GCNConv, Batch Normalization, ReLU, and Dropout operations in our GCN module:

class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers,
                 dropout):
        super(GCN, self).__init__()
        self.convs = torch.nn.ModuleList(
            [GCNConv(input_dim, hidden_dim)]
            + [GCNConv(hidden_dim, hidden_dim) for _ in range(num_layers - 2)]
            + [GCNConv(hidden_dim, output_dim)]
        )
        self.bns = torch.nn.ModuleList(torch.nn.BatchNorm1d(hidden_dim) for _ in range(num_layers - 1))
        self.dropout = dropout
 
    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()
        for bn in self.bns:
            bn.reset_parameters()
 
    def forward(self, x, adj_t, return_emb=False):
        for i in range(len(self.convs)):
            x = self.convs[i](x, adj_t)
            if return_emb and i == len(self.convs) - 2:
                return x
            if i < len(self.convs) - 1:
                x = self.bns[i](x)
                x = F.relu(x)
                x = F.dropout(x, p=self.dropout, training=self.training)
            else:
                out = torch.sigmoid(x)
        return out

After some hyperparameter search and attempts at learning rate scheduling, we settled at 3 layers, with 128 as the hidden dimension, no dropout, and 0.01 learning rate on an Adam optimizer for 200 epochs. The model was allowed to overfit and chose the epoch with the best validation F1 score. This model achieved 0.510 F1 (with 0.456 precision and 0.577 recall) on the test set.

Graph Attention Networks

Attention mechanisms have seen a meteoric rise in popularity over recent years, becoming the state-of-the-art method for many natural language processing and computer vision tasks. The benefit is that attention allows the neural network to effectively select a subset of the inputs that are seemingly more important to the task at hand, which lets us reserve our network expressiveness for the features most important towards classifying transactions as illicit or licit.

We built upon the Graph Convolutional Network by implementing the Graph Attention Network or GAT (Veličković et al. (2018)) to perform node classification over the graph structured data. The building block of the GAT is the graph attention layer, which is a variant of the aggregation function mentioned above. In the GAT, the input to each layer is a set of N node features where N is the number of nodes. The output of each layer is similarly a set of N node features, but these node features may have a new dimension. Then, all we do is now apply a shared attention function that computes the attention coefficients capturing the importance of each input node’s feature to every node.

This general formulation of attention allows each node to attend to all other nodes and can be formalized as

where a_i denotes the shared attention function, W_i denotes the weight matrix applied to every node, and h_i denotes the node features for node i.

A visual representation of this attention mechanism would be

Figure 5. Left: The attention mechanism a(W\vec{h_i} ,W\vec{h_j}) employed by our model, parametrized by a weight vector \vec{a} ∈ R^{2F} , applying a LeakyReLU activation. Right: An illustration of multihead attention (with K = 3 heads) by node 1 on its neighborhood. Credit to Veličković et al. (2018).

Our implementation follows

class GAT(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, args, emb=False):
        super(GAT, self).__init__()
        conv_model = GATConv
        self.convs = nn.ModuleList()
        self.convs.append(conv_model(input_dim, hidden_dim, heads=args.heads))
        assert (args.num_layers >= 1), 'Number of layers is not >=1'
        for l in range(args.num_layers-1):
            self.convs.append(conv_model(args.heads * hidden_dim, hidden_dim, heads=args.heads))
 
        # post-message-passing
        self.post_mp = nn.Sequential(
            nn.Linear(args.heads * hidden_dim, hidden_dim), nn.Dropout(args.dropout), 
            nn.Linear(hidden_dim, output_dim))
 
        self.dropout = args.dropout
        self.num_layers = args.num_layers
 
        self.emb = emb
 
    def reset_parameters(self):
        nn.init.xavier_uniform_(self.lin_l.weight)
        nn.init.xavier_uniform_(self.lin_r.weight)
        nn.init.xavier_uniform_(self.att_l)
        nn.init.xavier_uniform_(self.att_r)
 
    def forward(self, x, adj_t):          
        for i in range(self.num_layers):
            x = self.convs[i](x, adj_t)
            x = F.relu(x)
            x = F.dropout(x, p=self.dropout,training=self.training)
 
        x = self.post_mp(x)
 
        if self.emb == True:
            return x
 
        return torch.sigmoid(x)

We settled at a GAT network with 2 layers, with 128 as the hidden dimension, 0.5 dropout, and 0.01 learning rate on an Adam optimizer for 200 epochs. The model was allowed to overfit and chose the epoch with the best validation F1 score. This model achieved 0.636 F1 (with 0.681 precision and 0.597 recall) on the test set.

In summary, our results are below

Table 2. Graph ML results in comparison with classical ML techniques.

Correct and Smooth

Finally we introduce Correct and Smooth (Huang et al., ICLR 2021), a recent state of the art collective classification method. C&S is a post-processing procedure on a baseline predictor, such as the GAT network mentioned above to predict soft labels first and then use these first order predictions coupled with the structure of the graph to obtain the final predictions of all nodes. As seen below, we could have the class predictions below before using the C&S post processing procedure.

The intuition of C&S is to 1) correct predictions after noting that the degree of the errors are often biased, and then 2) smoothen the soft labels because they’re often not smooth over the entirety of the graph.

The correct step attempts to diffuse the training errors along the edges by making use of a normalized diffusion matrix A’=D^{-0.5}AD^{-0.5}, where D is the diagonal degree matrix and A is the adjacency matrix. The assumption is that errors are similar for nearby nodes and so we can update the training errors along the edges E^t as such

This process is depicted below here.

Next, we can smoothen the corrected labels along the edges by assuming that similar neighboring nodes tend to likely share labels. Illicit transactions are likely going to occur between pairs of nodes that frequently perform illicit transactions together and licit transactions occur between parties that typically are all licit. We can similarly diffuse the label along the graph’s edges again using a similar post processing step.

Our PyTorch implementation is shown below:

from torch_geometric.nn import CorrectAndSmooth
from torch_geometric.utils import to_scipy_sparse_matrix
from torch_sparse import SparseTensor
 
 
post = CorrectAndSmooth(num_correction_layers=50, correction_alpha=1.0,
                       num_smoothing_layers=50, smoothing_alpha=0.8,
                       autoscale=False, scale=20.)
 
row, col = data.edge_index
N = len(data.y)
adj = SparseTensor(row=row, col=col, sparse_sizes=(N, N))
deg = adj.sum(dim=1).to(torch.float)
deg_inv_sqrt = deg.pow(-0.5)
deg_inv_sqrt[deg_inv_sqrt == float('inf')] = 0
DAD = deg_inv_sqrt.view(-1,1)*adj*deg_inv_sqrt.view(1,-1)
DA = deg_inv_sqrt.view(-1,1) * deg_inv_sqrt.view(-1,1)*adj
y_soft = best_model_GAT(data.x, data.edge_index)
y_soft = post.correct(y_soft, data.y, split_idx["train"], DAD)
y_soft = post.smooth(y_soft, data.y, split_idx["train"], DA)
print('Done!')
test_acc = metrics.f1_score(data.y[split_idx["test"]], y_soft[split_idx["test"]])
print(f'Test Accuracy {test_acc:.4f}')

Conclusion

As suggested by the debut paper Weber et al. (2019), techniques relying completely on graph neural networks show significant promise in learning meaningful feature representations of nodes’ neighborhoods, but fall short as a pure classification technique themselves because they do not efficiently leverage the existing node features. Thus, combination techniques using the embeddings extracted from graph neural networks to complement the existing feature set may show superior performance in detecting illicit transactions.