Fraud detection with Graph Attention Networks

Anthony Taing
Stanford CS224W GraphML Tutorials
15 min readJan 18, 2022

By Sing Kwan NG and Anthony TAING as part of the Stanford CS224W course project.

Photo by André François McKenzie on Unsplash

Introduction to Fraud Detection

Fraud detection is a set of processes and analyses that allow businesses to identify and prevent unauthorized financial activity. This can include fraudulent credit card transactions, identify theft, cyber hacking, insurance scams, and more. A fraud occurs when someone takes money or other assets from you through deception or criminal activity.

Consequently, having an effective fraud detection system can help institutions to identify suspicious behaviors or accounts and minimize losses if the fraud is ongoing.The fraud detection model based on ML algorithms is challenging for many reasons: fraud represent a small portion of all the daily transactions, its distribution evolves quickly over time then true transaction label will be only available after several days, because investigators could not timely check all the transactions.

But traditional methods of Machine learning still fail to detect a fraud because most data science models omit something critically important: network structure.

Fraud detection like social networks imply the use of the power of a Graph. The following figure is an example of graph transactions network, we can see some nodes like bank account, credit card, person with their relationships:

Figure 1 Transactions graph example from https://neo4j.com/blog/enterprise-fraud-detection/

In fact, tabular data models, with data organized in rows and columns, are not designed for capturing the complex relationships and network structure inherent in your data. Analyzing data as a graph enables us to reveal and use its structure for predictions.

Therefore, using graphs handles many complex issues, it works for huge amount of data with multiple edges relationships that can change over time. Graph Machine learning improves the accuracy of fraud predictions by using more relevant features information from the network.

Here is the link of our Colab notebook described in this blog:

Graph Dataset

In this tutorial, we are using the Bitcoin dataset or Elliptic Data Set [1] [2] which maps Bitcoin transactions to real entities belonging to 2 categories: licit or illicit transactions. This is a public dataset and can be easily downloaded from the Data science platform Kaggle [3].

This is composed of 3 csv files, the first file maps nodes with their label (licit/illicit/unknown), the second file introduces the edges between 2 nodes by using their “Ids” and the last has nodes ids with 166 features.

A node in the graph represents a transaction, an edge represents a flow of Bitcoins between one transaction and the other. Each node has 166 features and has been labeled as being created by a “licit”, “illicit” or “unknown” entity.

There are 203,769 nodes and 234,355 edges in the graph, we have two percent (4,545) of the nodes that are labelled class1 (illicit), and twenty-one percent (42,019) are labelled class2 (licit) as we can see in Figure 2. The remaining transactions are not labelled with regard to licit versus illicit.There are 49 distinct time steps.

Figure 2: Visualization of numbers of nodes per class.

The first 94 features represent local information about the transaction and the remaining 72 features are aggregated features obtained using transaction information one-hop backward/forward from the center node — giving the maximum, minimum, standard deviation and correlation coefficients of the neighbor transactions for the same information data (number of inputs/outputs, transaction fee, etc.).

So with this dataset, we will work on a node classification task, where the goal is to predict if a node is licit or illicit.

A working knowledge of Pytorch is required to understand the programming examples.We will assume a basic understanding of machine learning, neural networks and backpropagation.

Data transformations

After you have downloaded the 3 csv files, we proceed to load them into pandas dataframes. We also reformat the classes from names to integers, and join the features and classes dataframes.

Next, we need to construct 3 different inputs to create the PyG dataset: edge index, node features and edge weights. The original edge data is a table where each row represents an edge, connecting 2 different node ids, as seen below:

Figure 3: Table representing edges, connecting 2 different node ids

We will convert transaction id’s to node ids, as the node features take the index as the node id. We also need to reshape it to a [2, num of edges] tensor.

Once this is done, we do a bit of processing for the node features by dropping unnecessary columns like transactions IDs, class. We also retain a list of indexes for known and unknown ids, which we will use in the dataset.

Figure 4: Visualizations of nodes features

We also need to train and validation split on the classified IDs using scikit’s function. The dataset has a lot of unknown nodes, hence training and validation can only be run on the nodes that have been labelled, even though the entire graph structure of all nodes is loaded into the dataset. The edge weights are also set to 1 as default.

Finally we convert data to PyGeometric graph data format by using node_features, edge_index, weights and labels initialized before.

Graph Neural Networks

The Graph Neural Networks (GNNs) [8,9,10] is gaining increasing popularity. GNNs are neural networks that can be directly applied to graphs and provide an easy way to do node-level, edge-level, and graph-level prediction tasks.

Recent years GNNs have seen significant developments and successes in many problems like the fields of biology, chemistry, social science, physics, and many others. It has led to state-of-the-art performance on several benchmarks. GNNs consider not only instance-level features but also graph-level features by performing message passage to agglomerate information from neighbors when making decisions, which is shown in the following figure. This leads to great performance on this kind of task.

Figure 5: Illustration of GNN mechanism

GNN converts a graphical relationship into a system where information messages are passed from neighborhood nodes through edges and aggregated into the target node (Figure 5 right). There are many variants of GNN which differ to each other on how each node aggregates and combines the representations of its neighbors with its own.

In this tutorial, for a fraud detection task on the Bitcoin transactions, we will present you the attention mechanism through the original GAT model then we will show you a second version called GATv2.

Graph Attention Networks (GATs) [4] are one of the most popular GNN architectures that performs better than other models on several benchmark and tasks, was introduced by Velickovic et al. (2018). Both these versions leverage the “Attention” mechanism [5] which has shown great success in various ML fields, e.g. for NLP with Transformers.

We will use the Pytorch Geometric PyG, which is the most popular graph deep learning framework built on top of Pytorch. PyG is suitable to quickly implement GNN models, with abundant graph models already implemented for a wide range of applications related to structured data. Moreover, GraphGym, available as part of the PyG, allows a training/evaluation pipeline to be built in a few lines of code.

Graph Attention Networks

1. GAT v1

GraphSAGE and many other popular GNN architectures weigh all neighbors messages with equal importance (e.g mean or max-pooling as AGGREGATE). However, every node in a GAT model updates its representation by attending to its neighbors using its own representation as the query. Thus, every node computes a weighted average of its neighbors, and selects its most relevant neighbors.

This model utilizes an attention mechanism α(ij) to determine the importance of each message being passed by different nodes in the neighborhood as in the following Figure, showing a single attention mechanism.

Figure 6: Illustration of a single attention mechanism of GAT

To compute the attention score between two neighbors, a scoring function e computes a score for every edge h(j,i) which indicates the importance of the features of the neighbor j to the node i where a shared attentional mechanism “a” and a shared linear transformation parametrized by the weight matrix “W” are learned (illustrated below in Figure 7), .T represents transposition and || denotes vector concatenation as in equation 1.

Figure 7: Illustration of attention mechanism with weight vector a from[4].

Our attention mechanism “a” will be a single-layer feedforward neural network parametrized by a weight vector , followed by a non-linear activation function (LeakyRelu). We only compute attention coefficients eij for nodes j in some neighborhood of node i in the graph:

These attention scores are normalized across all neighbors Ni using softmax. It turns the attention coefficients into a probability distribution, so the attention function is defined as:

The whole computations of this model can be visualized below:

Figure 8 Illustration of all steps of the GAT model

Then the embeddings from neighbors are aggregated together, scaled by the attention scores based on the structural properties of the graph (node degree). The GAT model computes a weighted average of the transformed features of the neighbor nodes (followed by a non-linearity ) as the new representations of i, using the normalized attention coefficients:

The attention mechanism has trainable parameters and is dynamic, vs a standard Graph Convolution Network / GraphSAGE where all messages are weighted equally.

An additional trick used is multi-headed attention, as it provides improved performance especially in stabilizing learning. It extends the expressivity of the network to capture different importance to nodes of a similar neighborhood.

Figure 9: Illustration of Multi-headed attention mechanism with 3 headed attentions, colors denote independent attention computations, inspired from [4] and course.

When using GAT with multi-headed attention shown in Figure 9, K independent attention mechanisms execute the transformation of the previous Equation (3), and on the final (prediction) layer of the network, as concatenation is no longer sensible with || a concatenation operation, shown in the following Equation (4):

instead, we use averaging, we get the following formula:

Consequently, in this first version, GAT model computes a static attention, where the ranking of attention coefficients is global for all nodes in the graph, it is shared across all nodes in the graph, and it is unconditioned on the query node. The attention function is monotonic with respect to the neighbor (key) scores; thus this method is limited and impacts on the expressiveness of GAT.

Now, to show you the GAT model, we are going to implement all previous equations by using some functions from PyG library then we are going to show you how to use the GAT model with PyG’s built-in layers.

Coding GAT with PyG

We can implement a simplified version of a GAT conv layer using PyG, based on the equations listed above. Equations and dimensions of output of each layer have been commented to improve readability and understanding.

PyG provides the MessagePassing base class, which helps in creating such kinds of message passing graph neural networks by automatically taking care of message propagation. The base class provides a few helpful functions.

Firstly, message() function allows you to define what node information do we want to pass for each edge, and the aggregate function allows us to define how we intend to merge the messages from all edges to the target node (“add”, “mean”, “max”, etc). Finally, the propagate() function helps us to run the message passing and aggregation over all the edges and nodes in the graph. Further details can be found at the official PyG documentation [11].

Even though a prebuilt GATConv is available, let’s start by writing a custom GAT layer using the messagePassing base class to better understand how it works.

Now with the layer all setup, in practice we will then use these convolution layers to create a neural network for us to use. Each layer consists of running the convolution layer, followed by a Relu nonlinear function and dropout. They can be stacked multiple times, and at the end we can add some output layers.

Figure 10: Architecture of our implemented 2 layers GAT model

The example code below is for 2 complete layers, with a linear MLP output layer.

Model training

For model training and evaluation, a trainer and metric manager classes are created to make this much easier. The GNNtrainer helps to run the training epochs plus we have added helper functions to make training and testing easier. The save_metrics() function helps us log all key metrics of each epoch during training, save_model() helps to save the model after it is completed and predict() helps to run the predictions on any dataset.

Next, the metric manager helps to calculate all the required metrics at each epoch. The metric manager also has a built-in function to help calculate the best results for the entire run.

As default parameters for all our experiments, we used a learning rate to 0.01, 100 epochs, 2 heads, dropout 0.5,128 hidden dimension and weight decay: 1e^(-5). We also use a BinaryCrossEntropy loss as it is a binary classification problem.

After configuring the optimizer and criterion, training can run smoothly. Adam optimizer is used as it generally has a good balance between speed of convergence and stability.

2. GAT v2

Whereas GATv2 can compute a dynamic attention which overcomes the previous limitation of the GAT model, every query has a different ranking of attention coefficients of the keys. We can finally define static attention and dynamic attention which make GAT and GATv2 different.

Attention is a mechanism for computing a distribution over a set of input key vectors, given an additional query vector.

Static attention: A family of scoring functions F computes static scoring for a given set of key vectors K= {k1,…,kn} and query vectors Q= {q1,…,qm}, if for every f ∈F there exists a “highest scoring” key jf ∈[n] such that for every query i ∈ [m] and key j ∈ [n] it holds that f (qi,kjf ) ≥ f (qi,kj).

Every function f ∈ F (family of scoring functions) has a key that is always selected, regardless of the query and this cannot model situations where different keys have different relevance to different queries. Thus, to prevent this limitation, we have the dynamic attention.

Dynamic attention: A family of scoring functions F computes dynamic scoring for a given set of key vectors K= {k1,…,kn} and query vectors Q= {q1,…,qm}, if for any mapping φ: [m] →[n] there exists f ∈F such that for any query i ∈[m] and any key j different to φ(i)∈[n] : f (qi,kφ(i)) > f (qi,kj).

We compute a dynamic attention for any set of node representations, such that there exists a constant mappings φ that map all inputs to the same output.

The GATv2 model performs better than the first version GAT, because it uses a dynamic graph attention variant that has a universal approximator attention function, it is more expressive than the other model, based on a static attention. We can see those differences in the following Figure:

Figure 11: Comparison of static vs dynamic attention from original paper[6]

The main problem in the standard GAT scoring function is that the learned layers W and a are applied consecutively, and can be collapsed into a single linear layer.

This new model GATv2 modifies the order of operations.They simply apply “a” layer after the non-linearity and the “W” layer after the concatenation, effectively applying an MLP to compute the score for each query-key pair. We can see those differences in the following formulas:

Moreover, this dynamic attention provides a much better robustness to noise because it has the ability to “focus” on the most relevant inputs, given a query by decaying other inputs, i.e., giving these decayed inputs lower scores than others. This helps the models distinguishing between given data edges (E) and noise edges (E′). While GAT’s performance severely decreases as noise increases.

Finally, the multi-head attention is often used to stabilize the learning process of a GAT, however a single GATv2 head generalizes better than a multi-head GAT.

Now, for the GATv2 model we are going to implement all previous equations by using some functions from PyG library. Then we are going to show you how to use the GATv2 model with PyG’s built-in layers.

Coding GAT v2 with PyG

As described above, we code a slightly modified version of the message passing object. The comments highlight the changes from the earlier version.

Similar to the first model implementation, we replace the previous one with our GATv2 layer and we use 2 complete layers, with a linear MLP output layer. We can also test the prebuilt layers from PyG for GATv2.

Performance

We ran various combinations of experiments, and the results are shown below. We set out 15% for a validation set from the known classified nodes. In the training process, we ran it for 100 epochs.

Figure 12: Visualization of models performances (Custom — Custom GAT layers, Prebuilt — Prebuilt GATConv layers, 3layers — Custom with 3 GNN layers (instead of 2)

Based on the model performance, we can see that GATv2 prebuilt is the best model, achieving good performance across all metrics. Due to highly imbalanced data, the f1 metrics is a very good representation of overall performance, and 0.92 on F1 macro is very good. It outperforms the GCN benchmark [2].

One observation is that the prebuilt GAT layers from PYG perform quite a bit better compared to our custom built GAT layers. This could imply some small tweaks and optimizations that they have done. It also converges much faster to its optimum performance.

There is a slight performance improvement on most metrics of GATv2 vs GAT, however for our custom built ones we were unable to demonstrate the same uplift.

Figure 13 Table with performances by models.

Last but not least is that varying the architecture showed different results. For example, we tried a version with 3 instead of 2 GNN layers, but this seemed to lower performance. This can actually happen where more is not always better with GNN. The reason is that when we stack many layers we can experience over-smoothing depending on the size of the graph. This is caused by a big overlap in the neighborhood nodes (called the receptive field) for each target node.

Visualization

After setting up the model, we can visualize how this looks. Firstly, we pick a time period to reduce the size of the graph as the dataset is segmented into 49 different time periods. After that we will use the NetworkX library to create a graph object, which we will use to plot the diagram.

The diagram below has the following legend: Green = Not illicit (not fraud), Red = illicit (Fraud), Blue = Predicted not illicit, Orange = Predicted illicit.

Figure 14 Visualization of original nodes and predicted nodes for time period 28.

Taking a look at the graph, a majority of transactions nodes are heavily linked in a cluster. The actual fraud and the predicted fraud from the new model are fairly distributed among the central cluster and the shorter transaction chains.

Conclusion

Graph Attention Network assign different importance to nodes of a same neighborhood, enabling a leap in model capacity and works on the entire neighboring nodes.

We show in this tutorial a detailed implementation of GAT and the improved GATv2, a more expressive one which uses dynamic attention by modifying the order of operations; it is more robust to noise edges.

The prebuilt GAT models both perform very well on the fraud dataset, achieving 0.92 F1 macro and 0.97 accuracy, beating existing benchmarks using GCN. As GATv2 outperforms GAT on several benchmarks, the GATv2 model should be considered as a baseline according to the authors, replacing the original GAT model.

References

  1. Elliptic website, www.elliptic.co.
  2. Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics, arXiv:1908.02591, 2019.
  3. M. Weber G. Domeniconi J. Chen D. K. I. Weidele C. Bellei, T. Robinson, C. E. Leiserson, https://www.kaggle.com/ellipticco/elliptic-data-set, 2019
  4. Graph Attention Networks, Petar Veličković and Guillem Cucurull and Arantxa Casanova and Adriana Romero and Pietro Liò and Yoshua Bengio, arXiv:1710.10903, 2018
  5. Attention Is All You Need, Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin, arXiv:1706.03762, 2017.
  6. How Attentive are Graph Attention Networks?, Shaked Brody, Uri Alon, Eran Yahav, arXiv:2105.14491, 2021
  7. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges, arXiv:2104.13478, M. M. Bronstein, J. Bruna, T. Cohen, P. Velickovic, 2021.
  8. https://tkipf.github.io/graph-convolutional-networks/
  9. https://towardsdatascience.com/a-gentle-introduction-to-graph-neural-network-basics-deepwalk-and-graphsage-db5d540d50b3
  10. https://towardsdatascience.com/an-introduction-to-graph-neural-network-gnn-for-analysing-structured-data-afce79f4cfdc
  11. PyG documentation: pytorch-geometric.readthedocs.io/en/latest/

--

--