Detecting Suspicious Transactions with Graph Neural Networks and Anomaly Scores
By Eric Lee, Alex Shan, and Brent Ju as part of the Stanford CS224W course project.
With billions of dollars transferred through online transactions, detecting fraud has become increasingly challenging yet more critical than ever. Thus, it is crucial to identify illicit transactions, and we demonstrate that graph neural networks offer a powerful approach to analyzing transaction networks. We explore the use of graph convolutional networks, graph attention networks, and anomaly detection techniques to identify suspicious Bitcoin transactions in the Elliptic dataset. By combining these methods with domain-specific enhancement, our approach achieves a 0.99 and 0.89 F1 score for detecting licit and illicit transactions respectively. Our results highlight the potential of using graph neural networks in combating financial fraud.
Code
The full code for these experiments and results can be found here:
Illicit Financial Transactions
Problem
In the modern era of online transactions and cryptocurrency, bad actors have found it easier to commit fraud through increasingly sophisticated means. These illicit activities often involve discreetly tunneling money through complex networks of accounts, which makes detection a significant challenge. In these scenarios, traditional rule-based systems struggle to identify fraudulent behavior, because they are ill-equipped to handle the layered transaction patterns that define modern fraud. The inherent complexity of online transaction networks demands innovative approaches to analyze and detect such illicit behavior.
Importance and Impact
Especially as our world becomes increasingly digital, it is crucial to ensure online security. With billions of dollars flowing through transactions every day, fraudulent activities have the potential to disrupt financial ecosystems on a massive scale.
As technology has advanced, it has become increasingly difficult to comprehensively trace transactions and identify consistent patterns indicative of fraud. Developing effective methods to identify such illicit transactions is not only important to safeguard individual assets, but also to maintain public trust in the financial systems we rely on.
The Elliptic Dataset
Background
For this project, we are experimenting with fraud detection methods using the Elliptic Dataset, which features an anonymized network of Bitcoin transactions [1]. The data set contains 203,769 individual transactions, where 4,545 are known to be processed by illicit entities like scammers, and 42,019 are known to be handled by trusted entities. The quality of the remaining 157,205 transactions are unknown. Noting that transactions are related when Bitcoin flows through one transaction into the next, we see there are 234,355 unique transaction relationships noted within the data.
Features
There are 166 features for each transaction, given as unnamed variables to comply with intellectual property regulations. The first 94 features in the node correspond to the attributes of the transaction, such as “number of inputs/outputs, transaction fee, output volume, and aggregated figures such as average [Bitcoin] received by the inputs/outputs” [1]. Then, the remaining 72 features represent aggregated attributes “giving the maximum, minimum, standard deviation, and correlation coefficients of the neighboring transactions” [1].
Splitting
To ensure the generalizability of our results, we split our data set transductively with a 75%-15%-10% split allocated between training, validation, and test transactions, respectively. Namely, we split by the transaction labels of licit/illicit, and evaluate the classification of our model on the test transactions only during final evaluation. Since this problem only involves predicting the nature of specific transactions rather than conducting inference on the relationships between transactions, this transductive approach is prefereable to an inductive one.
Why Graph Neural Networks?
Modeling Bitcoin Transactions as a Graph
Given that the flow of transactions creates a network, this web of Bitcoin transactions can be naturally modeled as a graph. Specifically, each of the 203,769 unique transactions can be represented as a node in the graph, which can have one of three labels: licit, illicit, or unknown. Further, each of the 234,355 relationships between transactions can be expressed as an edge directing the flow of Bitcoin from one transaction to the next. Then, we see that the task reduces to node classification on this graph.
To predict the class of each node, we proceed with applying graph neural networks (GNNs). GNNs are a flexible class of machine learning models that are able to capture complex relationships in graph data by propagating information across nodes and their neighbors. In doing so, GNNs create a robust embedding for each node that encodes not only its attributes but also captures the structure of its neighborhood. In this context, each node combines its transaction features with a measured aggregation of its neighboring nodes’ transaction features.
Context-Aware Node Representations
In this context, GNNs are a good choice because they are able to leverage information about the structure of transaction flows that may not be immediately obvious. Since GNNs aggregate features from neighboring nodes and edges, they are able to detect anomalies based on both individual transaction features and their context in the graph. This helps us deal with the cases where a transaction might seem normal when analyzed in isolation, but is clearly suspicious when viewed in the context of its connections.
Capturing High-Level Dependencies
Fraudulent activity often involves the use of many co-operating intermediaries to launder money through progressive transactions. While these relationships may be difficult to detect with simple rule-based techniques, GNNs can capture these patterns as they propagate information through many hops. As GNNs feature many layers, with aggregation at each layer, their architecture allows them to reason across multiple levels of transaction flow to identify high-level dependencies that may be indicative of fraud.
Detection with a Graph Convolutional Network
Network Architecture
To begin, we proceed with a graph convolutional network (GCN), where nodes aggregate information from neighbors with neural networks. The GCN fits our purpose well, as we hope to combine and aggregate information across multiple transactions to predict whether certain transactions are illicit or not.
Intuitively, we add many such pooling layers to our GCN, then minimize the loss in our network to adjust the embeddings (feature vector representations) for each feature.
The Convolutional Network Layer
In our application, we use the standard GCN layer popularized in the “Semi-Supervised Classification with Graph Convolutional Networks” paper [4]. More specifically, at each layer and for each node v, we update its embedding h by:
- Applying a linear transformation B to the embedding for v, and applying a linear transformation W to the mean of the embeddings for the neighbors of v, where B and W are learnable matrices.
- Summing the linear transformations into one term.
- Applying non-linearity, such as the ReLU function.
Rigorously, we update h as follows:
Training and Evaluation
We implemented a two-layer GCN with the above update rule, using PyG’s GCNConv function [5]. We applied batch normalization at each layer and chose ReLU for non-linearity.
class GCN(torch.nn.Module):
def __init__(self, dim_in, dim_h, dim_out):
super(GCN, self).__init__()
# First layer
self.b_norm1 = BatchNorm1d(dim_in)
self.conv1 = GCNConv(dim_in, dim_h)
# Second layer
self.b_norm2 = BatchNorm1d(dim_h)
self.conv2 = GCNConv(dim_h, dim_out) # Output dim is 1
def forward(self, x, edge_index):
# First GCN layer with Batch Normalization and ReLU
x = self.b_norm1(x)
x = self.conv1(x, edge_index)
x = F.relu(x)
# Second GCN layer with Batch Normalization and output logits
x = self.b_norm2(x)
x = self.conv2(x, edge_index)
return x # Logits for binary classification
We trained the GCN for 300 epochs with learning rate 0.005, weight decay 0.0001, and a hidden layer size of 125. To evaluate, we focused on confusion matrix metrics since the problem is a classification task. In particular, we focus on F1 score because both precision and recall are important: we want to limit the number of “false alarms” raised that may disrupt the day-to-day flow of transactions while not missing any instances of fraud that could result in an impactful loss of assets.
In the confusion matrix below, we see the performance on our different classes, where licit transactions = 0 and illicit transactions = 1.
Additionally, we see our desired metrics here. Notably, the GCN performs relatively well in classifying licit transactions, with high scores ≥ 0.98 across precision, recall, and F1 score. However, the GCN performs notably worse on the illicit transactions, which are the important ones! Thus, we need to make improvements to our model.
Concatenating Anomaly Detection Subscores
Noting that illicit transactions are the exception to the norm, we believe that we can apply anomaly detection techniques to improve our model. To give the model more signal about which transactions are illicit, we want to compute additional features to concatenate to our node input vectors in the form of anomaly scores. Existing work [6] distinguishes between two types of anomalies: (1) substructure anomalies, where the structure of a group of nodes is significantly different than other groups of nodes within the graph, and (2) attribute anomalies, where a node is significantly different from its neighbors. Modeling this approach, we calculate these anomaly scores for our transaction graph.
Substructure Subscores
First, we want to calculate an anomaly score measuring the extent to which a node belongs in a cluster of nodes with an unusual structure. As in the original paper [6], we follow the intuition that densely connected clusters of nodes (representing many back and forth transactions) are indicative of anomalous and hence illicit nodes.
We define these densely connected clusters as cliques, where every node within the group is connected to every other node in that group. We then populated R, the set of anomalous regions, by iterating through all cliques of various sizes. Then, for each such region, we increased the anomaly score for each node in the region such that a node belonging in more cliques would have a higher anomaly score.
Formally, with sub-region (clique) r and substructure score s, we have for each node i:
In our code, we conducted this computation by first finding the cliques:
"""
Find and sort cliques to compute substructure subscores.
"""
# Convert to undirected so we can find cliques easier
edge_list = data.edge_index.T.numpy()
G_undirected = nx.Graph()
G_undirected.add_edges_from(edge_list)
# Now find cliques in the undirected graph
cliques = list(nx.find_cliques(G_undirected))
Then, we proceed by computing substructure subscores in accordance with the equation defined above, for each node:
"""
Compute the substructure subscores for all nodes.
"""
# Only consider strongly connected components of 3 or more; 2 is trivial (just an edge)
new_cliques = []
for clique in cliques:
if len(clique) >= 3:
new_cliques.append(clique)
id_to_s_subscore_dict = {}
for node_id in id_to_index.keys():
id_to_s_subscore_dict[node_id] = 0
# Calculate unnormalized scores for substructures
for clique in new_cliques:
for index in clique:
node_id = index_to_id[index]
id_to_s_subscore_dict[node_id] += len(clique)
unnormalized_id_to_s_subscore_dict = id_to_s_subscore_dict.copy()
Attribute Subscores
To account for attribute anomalies, we compute a separate anomaly subscore. For this calculation, we use the intuition that an illicit transaction should have similar attributes to its neighboring transactions as bad actors work together to funnel money into distant accounts. To encode this idea, we generated contrastive subgraph examples for each node, consistent with prior work [7].
For each node i, we define its positive subgraph to be the set of its neighboring nodes and set its negative subgraph to be the set of neighboring nodes of a random node j where i is not j. We then sum up the dot products of the features of i with the features of every other node m within each subgraph, as a measure of the combined similarity between node i and that subgraph. Finally, we take the difference between the positive and negative sum to arrive at one final score.
Formally, with p(i) being the positive subgraph of i and n(i) being the negative subgraph of i, we calculate subscore s as follows:
We conduct this calculation in our code to arrive at the attribute subscore for each node. We start with creating the desired subgraphs:
"""
Calculate attribute subscores for each node i
"""
# Create subgraphs with positive and negative examples
edge_list = data.edge_index.T.tolist()
G = nx.Graph()
G.add_edges_from(edge_list)
id_to_embedding = {} # node id to vector reps
for index, row in features_df.iterrows():
node_id = index_to_id[index]
id_to_embedding[node_id] = row.values.tolist()
Then, using the subgraphs, we compute the attribute subscore:
def get_k_hop_neighbors(graph, node, k):
return list(nx.single_source_shortest_path_length(graph, node, cutoff=k).keys())
def calculate_attribute_subscore(node_id, k=1):
positive_subgraph = get_k_hop_neighbors(G, id_to_index[node_id], k)
random_node = random.choice(list(G.nodes))
negative_subgraph = get_k_hop_neighbors(G, random_node, k)
# Get embeddings for the positive subgraph nodes
vi = torch.tensor(id_to_embedding[node_id])
pos_embeddings = torch.tensor([id_to_embedding[index_to_id[idx]] for idx in positive_subgraph])
neg_embeddings = torch.tensor([id_to_embedding[index_to_id[idx]] for idx in negative_subgraph])
pos_similarity = torch.sum(vi @ pos_embeddings.T) / len(positive_subgraph)
neg_similarity = torch.sum(vi @ neg_embeddings.T) / len(negative_subgraph)
subscore = -pos_similarity + neg_similarity
return subscore.item()
# Now, we compute attribute subscores for every node
node_id_to_score = {}
for node_id in node_ids:
subscore = calculate_attribute_subscore(node_id, k=1)
node_id_to_score[node_id] = subscore
unnormalized_node_id_to_score = node_id_to_score.copy()
Sanity Check on Anomaly Scores
As a quick sanity check, we examine the mean sum of subscores between the licit nodes and illicit nodes to check that these scores provide signal.
"""
Compute average anomaly score for all licit nodes.
"""
node_tuples = []
licit_sum = 0
for node in licit_nodes:
licit_sum += unnorm_node_id_to_full_score[node]
node_tuples.append((node, unnorm_node_id_to_full_score[node], 'licit'))
print(f"Average score for licit nodes is {licit_sum / len(licit_nodes)}")
"""
Compute average anomaly score for all illicit nodes.
"""
illicit_sum = 0
for node in illicit_nodes:
illicit_sum += unnorm_node_id_to_full_score[node]
node_tuples.append((node, unnorm_node_id_to_full_score[node], 'illicit'))
print(f"Average score for illicit nodes is {illicit_sum / len(illicit_nodes)}")
We see that the average score for licit nodes is -1478.8911, while the average score for illicit nodes is -3360.4476, which appears to yield a meaningful difference. Hence, we proceed.
Concatenation of Subscores and Evaluation
Now that we have a substructure subscore and attribute subscore for each node, we concatenate these subscores to the feature vectors for each node.
subscore_a_list = []
subscore_s_list = []
for node in features_df[0]:
subscore_s_list.append(u_id_to_s_subscore_dict[node])
subscore_a_list.append(u_node_id_to_score[node])
node_features = torch.cat((node_features, torch.tensor(subscore_s_list).unsqueeze(1)), dim=1)
node_features = torch.cat((node_features, torch.tensor(subscore_a_list).unsqueeze(1)), dim=1)
Then, we train the GCN on these augmented feature vectors with the same hyperparameters as before (300 epochs, learning rate 0.005, weight decay 0.0001, hidden layer size 125). As before, licit transactions have label 0 and illicit transactions have label 1.
Unfortunately, we see that our precision for licit transactions dropped by 0.01, the recall for illicit transactions dropped by 0.04, and the F1 score for illicit transactions dropped by 0.01! Something went wrong when we added our subscores, pointing to potential problems with the expressivity of the model or the subscores themselves. To progress, we work on each of these parts in isolation.
Incorporating the Attention Mechanism
The Graph Attention Network (GAT)
One potential problem is that that our graph neural network treats all neighbors equally, hence failing to recognize signals that are strongly indicative of illicit nodes. To address this, we adopt the Graph Attention Network (GAT), whose architecture features an attention mechanism analogous to the self-attention that powers transformers in natural language processing (NLP):
The appeal of using the GAT is that the attention mechanism will learn not only specified weights, but also attention weights that inform the model which neighbors each node should put more weight toward when aggregating signals through the layers.
Multi-Head Attention Mechanism
We follow the attention layer defined in “How Attentive are Graph Attention Networks,” where the network learns a scoring function e that designates the importance or “attention” to be placed on each neighbor’s messages [8]. The scoring function e takes the embeddings of two nodes i and j as input, and transforms the embeddings with non-linearity, a linear layer, and concatenation. Rigorously, with alpha and W to be learned:
Then, we calculate the attention score between nodes i and j using the scoring function, normalized across neighbors:
With these attention scores, the GAT proceeds to modify the GCN layer by weighing each transformed message with the attention score to create the embedding at the next layer:
Training and Evaluation
We now create a GAT, with the same structure as our previous GCN for comparison but with GAT layers replacing GCN layers. In our implementation, we use GATv2Conv, as it implements the attention mechanism outlined above [9]. We use a two-layer GAT, with batch normalization at each layer and Leaky ReLU as non-linearity on the output.
class GAT(torch.nn.Module):
def __init__(self, dim_in, dim_h, dim_out, heads=8):
super(GAT, self).__init__()
self.b_norm1 = BatchNorm1d(dim_in)
self.gat1 = GATv2Conv(dim_in,
dim_h,
heads=heads,
dropout=0.3)
self.b_norm2 = BatchNorm1d(dim_h * heads)
self.gat2 = GATv2Conv(dim_h*heads,
dim_out,
heads=heads,
concat=False,
dropout=0.6)
def forward(self, x, edge_index):
h = self.b_norm1(x)
h = self.gat1(h, edge_index)
h = self.b_norm2(h)
h = F.leaky_relu(h)
out = self.gat2(h, edge_index)
return out
Again, we trained with 300 epochs, learning rate 0.005, weight decay 0.0001, and hidden layer size 125. After training, we saw noticeable improvement in both the confusion matrix and the metrics.
Note the improvement in the confusion matrix, as the number of false positives/negatives decreased from 36 to 25, and 122 to 75.
Additionally, we see that the metrics reflect this change, with improvements across all metrics for both licit transactions and illicit transactions. This performance suggests that the change in model to add GAT layers was effective, with the attention mechanism helping the model focus on which neighbors held critical information about fraud status.
Subscore Error Analysis
With the improvement in performance from changing models, the next step is to investigate the subscores. While the mean scores by category showed a difference separating the two groups, it is important to note their distributions as well.
After examining the distribution of subscores, it becomes clear that that these scores are extremely noisy, with the differences being caused by outliers rather than signal! Clearly, we need to workshop these subscores to better fit this domain.
Branch-Based Substructure Subscores
While the idea of determining a score based on anomalous substructures still holds [6], the idea of searching for densely connected clusters as in the original paper may not hold. We explore these structures by sampling random illicit and licit nodes from the training set, and then running breadth-first search (BFS) to determine the branching structure.
Illicit Branching Structure
For illicit nodes, it seems that connections often do not go far, and when they do travel, they progress in a chain. Intuitively, this makes sense: bad actors are likely to funnel Bitcoin through a chain of wallets, without interacting with other entities. Meanwhile, licit entities are more likely to have high branching factors, such as reputable exchanges that interface with many different customers and accounts.
Licit Branching Structure
Examining the branching structure of licit nodes supports the idea that the transactions of good entities have larger branching factors. While some of the branching factors are still low, we see examples of wide-spreading networks that are not seen when starting to branch from illicit nodes.
Branching Comparison
To confirm that there is a meaningful difference between branching for illicit nodes and licit nodes, we plot the size of their BFS tree on the training set (the number of nodes within the tree when running directed breadth-first-search) as histograms.
Clearly, there is a significant difference! While there is only one instance of an illicit node with tree size larger than 15, there are many such licit nodes. Moreover, the average tree size is much higher for licit nodes as well.
New Substructure Subscore
Thus, having validated that this property separates licit and illicit nodes by distinguishing structure, we use BFS tree size as our domain-specific substructure score:
new_substructure_subscore = len(list(nx.bfs_tree(G, source=node)))
Streamlined Attribute Subscores
We conduct similar analysis with our attribute subscores to improve them for this specific problem.
Transaction Attribute Relationships
For illicit nodes, we see that they often feed into another illicit node, where these nodes should be similar due to their roles as co-consipirators.
Meanwhile, licit nodes seem to branch out often, leading to more nodes which have unrelated features. This is intuitive, as Bitcoin exchanges will deal with all types of different transactions.
New Attribute Subscore
Seeing these patterns, we aim to simplify our subscore calculation to reduce potential noise. For our new attribute subscore, we calculate the average Euclidian distance between the given node and each of its neighbors as a more straightforward measure of distance.
def average_euclidean_distance(node, G):
neighbors = list(G.neighbors(node))
if not neighbors:
return 0 # Handle cases where a node has no neighbors
node_embedding = node_features[id_to_index.get(node)]
neighbor_embeddings = [node_features[id_to_index.get(neighbor)] for neighbor in neighbors]
distances = []
for neighbor_embedding in neighbor_embeddings:
distances.append(np.linalg.norm(node_embedding - neighbor_embedding))
return sum(distances) / len(neighbors)
new_attribute_subscore = average_euclidean_distance(node, G)
Final Model with Updated Subscores
Concatenation of New Subscores
Now that we have the new subscores, we want to concatenate them to our feature vectors for the GAT, as we did with our subscores in the GCN.
# Concatenate the substruct subscores here
G = nx.from_pandas_edgelist(edgelist_df, source="txId1", target="txId2", create_using=nx.DiGraph())
bfs_tree_list = []
for node in features_df[0]:
bfs_tree = list(nx.bfs_tree(G, source=node))
bfs_tree_list.append(len(bfs_tree) * (u_id_to_s_subscore_dict[node] + 1))
node_features = torch.cat((node_features, torch.tensor(bfs_tree_list).unsqueeze(1)), dim=1)
# Concatenating the new attribute subscores here
attr_subscore = []
for node in features_df[0]:
attr_subscore.append(average_euclidean_distance(node, G))
node_features = torch.cat((node_features, torch.tensor(attr_subscore).unsqueeze(1)), dim=1)
Training and Final Evaluation
Then, using the same GAT architecture and hyperparameters as before (changing input_size to account for the subscores), we run our final experiment.
We see some improvement from our original GAT model, especially in classifying illicit transactions.
Further, we see an increase in illicit recall and F1 score by 0.01. We believe this increase is notable given the difficulty of the benchmark, as well as the monetary impact a 1% improvement can bring in preventing large-scale fraud.
Concluding Insights
Comparison Between Models
Ultimately, the GCN offered a successful baseline, but the addition of anomaly subscores reduced performance. Upon further analysis, these original subscores did not properly capture distinguishing signals between licit and illicit transactions, and instead simply added noise. This points to the importance of domain-specific feature augmentation, demonstrating that the generalized approach to creating attribute and substructure subscores [6] needs to be modified for each specific problem.
Applying the attention mechanism in switching to the GAT layer from the GCN led to significant improvement, as the neural network was able to learn discriminately in placing weight on different neighbors’ messages. Then, combining the modified subscores fit to this domain with the GAT yielded the best results, with an F1 score of 0.99 on licit nodes and 0.89 on illicit nodes. In practice, this would raise very few false alarms and flag the vast majority of fraudulent transactions!
Future Work
In seeing the success of concatenating anomaly subscores following exploration of the data, a natural extension for future work would be to delve deeper into feature augmentation to discover additional techniques to identify illicit transactions.
References
[1] Elliptic. “Elliptic Data Set,” Kaggle. 2019. https://www.kaggle.com/datasets/ellipticco/elliptic-data-set
[2] Mark Weber et al. “Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics”, KDD ’19 Workshop on Anomaly Detection in Finance, August 2019. Anchorage, AK, USA.
[3] Tom Robinson. “The Elliptic Data Set: Working With the Community to Combat Financial Crime in Cryptocurrencies,” Elliptic. August 12, 2019. https://www.elliptic.co/blog/elliptic-dataset-cryptocurrency-financial-crime
[4] Thomas N. Kipf & Max Welling. “Semi-Supervised Classification with Graph Convolutional Networks,” ICLR, February 22, 2017. https://arxiv.org/abs/1609.02907
[5] PyG Team. “GCNConv,” PyTorch Geometric. https://pytorch-geometric.readthedocs.io/en/2.5.2/generated/torch_geometric.nn.conv.GCNConv.html
[6] Jincan Duan et al. “ARISE: Graph Anomaly Detection on Attributed Networks via Substructure Awareness,” IEEE TNNLS. October 1, 2023. https://arxiv.org/pdf/2211.15255
[7] Bo Chen et al. “GCCAD: Graph Contrastive Coding for116
Anomaly Detection.” IEEE, 2023, https://ieeexplore.ieee.org/abstract/document/9870034
[8] Shaked Brody, Uri Alon, & Elan Yahav. “How Attentive are Graph Attention Networks?” ICLR, 2022, https://arxiv.org/abs/2105.14491
[9] PyG Team. “GATv2Conv,” PyTorch Geometric. https://pytorch-geometric.readthedocs.io/en/2.5.1/generated/torch_geometric.nn.conv.GATv2Conv.html