From One Line to Fully Customized — A Three-Level Introduction to Graph ML Fraud Detection with PyG

By Eric Zhao and Yikun Chi as part of the Stanford CS224W course project.

Yikun Chi
Stanford CS224W GraphML Tutorials
16 min readJan 26, 2022

--

With the rise of the internet over the past few decades, so has the number of fraudsters and their various methods of committing fraudulent activities. From identity theft to phishing attempts to fake reviews, these fraudsters commit a wide variety of crimes on the internet. In fact, in 2021 alone, digital fraud attempts are up 25% in the United States and up 109% in the financial services industry.

Many fraud detection scenarios can be formed as classifying nodes (transaction, reviews, and etc) in a connected graph. Looking at data from a graph perspective can bring in additional benefits and leverage the connection between data points. The traditional approach is to use graph-based features and traditional ML models. Over the past few years, with the development of Graph Neural Networks (GNNs)(e.g.: GCN, GAT, GraphSage and many others), we can build more advanced GNN-based fraud detection models.

In this project, our team demonstrates the usage of such methods in tackling a specific kind of fraud: fake reviews. Using PyG, we provide implementation examples of established GraphML methods (i.e.: GCN) as well as more custom methods (GraphConsis).

A Gentle Introduction to GraphML

A graph is a collection of nodes and edges

A simple graph

Before diving into more details, we first want to provide a brief introduction to GraphML: after all, what is GraphML and what about it makes it different from “conventional” machine learning. At its very core, GraphML is a branch of machine learning that deals specifically with graph data. In general, a graph (or relational graph to be specific) can be thought of as a data structure that consists of a collection of entities called nodes (e.g. person, review, transaction) and a collection of interactions between pairs of nodes called edges, each of which potentially having a feature vector. These graphs can come in many different kinds, as we can have undirected and directed graphs, homogeneous and heterogeneous graphs, among others. Phuc Nguyen has a good blog on different kinds of graphs.

Graph Neural Network (GNN) is a specific set of neural network framework for graph data

Extending machine learning to the sphere of graph learning one step further, one common way of solving the above problem of node classification is through the usage of Graph Neural Networks (GNNs), where we apply a deep learning approach in the domain of relational graphs. In general, the framework of training a GNN can be thought of in four steps:

  1. Graph Augmentation: We need to ensure that each node in a graph contains node-level features (just like regular machine learning). Sometimes relational graphs have no node features and only an adjacency matrix. In order for GNNs to work, we need to have node features of each node (even if it’s as simple as a node with p 0 values)
  2. Create each GNN Layer: In a regular neural network, we have different layers such as convolution layer, linear layer, and etc, GNN has its own type of GNN layer, such as GCN layer, GraphSage layer. Different types of GNNs tend to differ here the most. i.e.: GraphSage graph model will have GraphSage GNN layer
  3. Stack Layers together: We stack many GNN layers and other common neural net layers such as Dropout, ReLU, and etc to build a GNN model
  4. Define the learning objective: Graph models could have many learning objectives, such as node level prediction, edge level prediction, or even graph level prediction.

GNN Layer Detail

As we can see from the above introduction. Building GNN is very similar to building a regular neural network. The only difference is that we use GNN layers instead of the normal neural network layers. So now we dive into what exactly a single layer of GNN does:

A single GNN Layer passes information from neighbor nodes to the target node (message-passing), and aggregate all the information (aggregation). It does this for every node.

A 2 layer GNN

The figure above shows an example of two layers GNN model. The boxes in the figure represent a layer. Each GNN Layer involves two processes: message-passing and aggregation.

During the message-passing phase, every node receives information from its neighbors. For example in the middle of the figure, we can see that node C receives information from its neighbor A, B, E, F.

The node will then aggregate all received messages (usually through a sum/mean), apply a linear layer (matrix multiplication) followed by a non-linear function. We call this output the node’s embeddings at this particular layer. Node’s embeddings at layer 0 are simply its feature.

To see things from a top-down perspective. In a two-layer GNN, the output for node A is node A’s embeddings at layer 2. In order to obtain this, we need to aggregate messages from node A’s neighbors (B, C, D), and the messages are based on neighbors’ layer 1 embeddings. For A’s neighbors C, in order to get its layer 1 embedding, we need to aggregate messages from node C’s neighbors, and the messages are based on C’s neighbors’ layer 0 embeddings.

To put message-passing and aggregation in a more concrete perspective, below is the message-passing and aggregation formula for one type of graph neural network, Graph Convolution Network (GCN):

GCN Layer

In the first line, we can see that the sigma represents a non-linear function, the mean is the aggregate function. The message is previous layer node embedding through a linear transformation. So overall, given a target node v, all of v’s neighbors u will generate a message by linearly transforming their embedding at the previous layer. Node v will take the mean of those embeddings and apply a non-linear transformation.

In practice and the subsequent content of this blog, given all the v’s neighbors go through the same linear transformation, we can simply average all the embeddings first, and then apply the linear transformation (line 2).

A single layer of GNN will perform messaging passing and aggregation for every node and its neighbors. This may sound complicated to code, but PyG provides us with a very intuitive implementation framework for building GNN. With the overview in mind, we will now discuss our specific implementation of GNNs in PyG in order to concretely illustrate the usage of such methods for the tasks of fake review detection.

Data Overview: Yelp Chicago restaurant and hotel review

The YelpChi dataset contains 45,894 reviews for a set of hotels and restaurants in the Chicago area. The node feature is derived based on the content of the review and derived vector using Word2vec. The label of each review (fraud or non-fraud) is determined by Yelp’s internal algorithm.

The details of the data processing can be found in the accompanying Colab. For now, we just want to highlight that a graph object datain PyG has to contain

  • data.x : Node feature matrix with shape [num_nodes, num_node_features] . In the Yelp Chicago dataset, the number of node features is 32.
  • data.edge_index: Edge index in COO format with shape [2, num_edges] So given a particular edge, its slice in the first dimension data.edge_index[:, edge_idx] is in the form of [source_node_index, target_node_index] .
  • data.y : Node label for node prediction with shape [num_nodes]
  • data.train_mask/val_mask/test_mask : Boolean with shape [num_nodes]indicating which nodes should be used in the training/validation/test set respectively.

One-line solution: Creating a Graph ML model using PyG Built-in Model

PyG provides a variety of pre-built GraphML models such as GCN, GraphSage, and more. Using those pre-built models, we can construct our graph machine learning model using one line of code. Below, we construct a two-layer GCN for our spam review classification task. The input channel is 32 (node feature dimension), output channel is 2 (binary classification using node embedding).

Using PyG pre-built model

Boom! Just like that, we constructed our first Graph Convolution Networks! We can now train the network using the code below:

Training codes

Notice at line 8, in comparison to normal deep learning models, a Graph neural network requires not only x as input but also the edge information. This is why we are feeding both data.x and data.edge_index into the model.

One small side notes: since our dataset is very imbalanced (over 95% one class), we use Focal Loss developed by Facebook. The technical detail on Focal Loss is outside of the scope of this PyG tutorial, so we will skip it for now. The only relevant part is that 1) Focal Loss takes class probability output, and 2) by increasing the parameter gamma, Focal Loss could force the model to pay more attention to the cases that are hard to classify. In comparison to binary entropy loss, Focal Loss will help us slightly to obtain results that are not just all predicting non-fraud. Bhattacharya has a good blog about Focal Loss if you want more detail.

Model Evaluation

After training for 50 epochs (no hyperparameter turning), we evaluate our model performance. Although we got over 94% accuracy, we only recovered 2 out of 263 fraud labels. The high accuracy is due to the dataset being extremely imbalanced.

Multi-line Solutions: Creating a Custom Graph ML model using PyG Built-in Messagepassing Layer

Let’s forge ahead! What if you want to have more variety in the hidden dimension, or customize a bit more on using different layers? Similar to PyTorch, PyG also provides a variety of pre-built layers. We can construct our custom graph ML models by composing different message-passing layers together like below:

Building Custom Model with PyG pre-built Messagepassing Layers

You should find this very familiar, almost exactly like building ML models in PyTorch! In lines 10 and 11, we can see that our own graph ML models contain two message passing layers, and they are both Graph Convolution Network layers. If we want, we can easily replace any GCNConv layer with other layers such as SAGEConv, or add other types of layers such as pooling in the model’s forward process. In comparison to the previous section, we added a drop-out layer in our custom model using the GCNConv layer.

Some differences worth noting when comparing our graph ML models to traditional deep learning ML models or the pre-built PyG models: our model now takes in a graph object as input. So instead of calling model(data.x, data.edge_index) , we can just call model(data) , and within the forward process, the model will automatically obtain the edge information from the overall graph object. As you will see in the next section, when building fully custom graph ML models, we will append more information to the graph object data this way.

After training for 50 epochs (no hyperparameter turning), our model achieved an accuracy of 84%, but we recovered 52 out of 263 fraud labels.

So far things are not too hard. Grab a coffee and take a rest! Next, we will dive into how to develop fully-customized graph ML models and write our own message-passing layer.

Welcome Back!

Fully Customized Graph ML model using Custom Message-passing Layer

To further dive into PyG, what if we want to try out our own message passing layers? PyG provides a base class layer called MessagePassing, and we can develop our own message-passing layer on top of this base class.

Paper “Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection” introduced a particular kind of GNN: GraphConsis. In this tutorial and accompanying Colab, we provide an implementation of some of the message passing strategies discussed in the paper.

A quick walkthrough on the GraphConsis message-passing mechanics and theory

Recall in our very brief introduction to the graph neural network structure, at each messaging passing layer, the center node’s neighbors will pass its embeddings to the center node. In our Yelp review use case, the number of fake reviews is far less than legitimate reviews. As a result, when we build a graph of reviews, most of the fake review’s neighbors are legitimate reviews. So directly aggregating all neighbors will only propagate information from legitimate reviews to fake reviews. Ideally, given a fake review central node, we want to only propagate information from neighbors that are also very likely to be fake, and vice versa. The paper proposed two modifications to the message passing process: 1) Context embedding, and 2) Neighbor sampling.

Context embedding: In context embedding, we simply add a trainable embedding to each of the node features. For a particular node, the original feature is a vector of length 32. Context embeddings append another trainable vector of length 32 in front of the original feature. So overall the input feature is a vector of length 64, and the first 32 entries are trainable and initialized to 0. This context embedding will help us capture the local structure of the node through training, and can help distinguish between fraud and non-fraud.

Neighbor Sampling: Neighbor sampling applies a weight to the traditional mean aggregation process. Given a pair of node u, v, its consistency score s at layer l is defined as (h is the embedding)

Consistency Score Definition

The final weight of the message from node u to node v is defined as the consistency score between u, v normalized by the consistency score of all v’s neighbors:

Final Weight Definition

This weighted mean aggregation capture the intuition that we should propagate message from fraud cases to fraud cases and vice versa. (Note the original paper also applies a threshold to discard any neighbors with a very low consistency score. This step is skipped for now)

Now with the theory out of the way, let’s see how we can implement context embedding and neighbor sampling in PyG by building a custom message-passing layer.

Implementing GraphConsis Models

First, we want to modify our data to add trainable embeddings:

The first line appends a zero vector of length 32 to every node. The second line initializes a trainable embedding of shape [num_nodes, 32] and added to our graph object data. This trainable embedding will be added to the optimizer parameter group and updated by the optimizer.

Next, let’s take a look at our GraphConsis Models:

Graph Consis Model

As we can see, the GraphConsis class is virtually the same as our custom model with GCN layers in the previous section. There are only two differences: 1): At lines 4 and 5, instead of using GCNConv provided by PyG, we are using our own custom layer. 2): At lines 10 and 11, we obtain the trainable embedding from the data object, and add trainable embeddings to the feature vector data.x , which is [num_of_nodes, 64] shape. This way, during the model training processes, the trainable embeddings will be updated.

Finally, let’s take a look at our custom message-passing layer, which implements the neighbor sampling strategy. The custom_Layer has the following structure:

Our custom_Layer is built on top of the provided MessagePassing base class. It overwrites the initialization, reset_parameters, forward, message, and aggregate method. A more detailed introduction on MessagePassing and its methods will be in the Colab. But below is the basic:

High-Level Flow of Custom Message-passing Layer

Forward Function: The forward function takes in the input x , which has the shape [num_of_nodes, dimension_of_previous_embedding] , and the edge index of the graph. It starts the message passing process by calling the propagate function. We already added trainable embedding to the node feature data.x in the GraphConsis model, now we need to build the neighbor sampling weight matrix in the forward function:

Forward Function Snippet

Since the forward function receives the previous layer embedding matrix x . We can calculate the weighting coefficients here, and feed the result into the downstream process. In line 1, torch.cdist returns the pairwise 2 norm distances. The final output cm is a consistency score matrix of shape [num_of_nodes, num_of_nodes] Entry cm[i][j] stores the consistency score between node i, j using the previous level embedding. The rest of the lines (2–34) simply build the column stochastic (each column sums to 1) weight matrix based on the edge information and the consistency matrix. The final matrix result has the shape [num_of_nodes, num_of_nodes]. Entry result[i][j] indicates the weights for node i as the neighbor of node j. To put it another way, if we want to know what is the node i’s weight as a neighbor in node j’s aggregation process, we take the jth entry of row i of the result matrix.

The forward function then calls propagate function and obtains the post aggregation output. The output of the propagate will have the shape [num_of_nodes, dimension_of_previous_embedding] . At row i, the result contains aggregated messages from all i’s neighbors. Within the forward function, we can perform additional operations on the aggregated output messages, such as applying a linear transformation(Think about the second line in our GCN formulation). The final output of the forward function is the computed embedding at this layer.

Propagate function: The propagate function is called during the forward process. It takes in edge index, a pair of embedding matrices, and any other information we want to pass along. For the pair of embedding matrices, one is assuming that each row in the matrix is the target node, and the other one is assuming that each row in the matrix is the source (the neighbors) node.

Calling Propagate Function

In our call to the propagate function, we also pass our weight matrix result .When passing in pair of content weight_mat = (result, result) , PyG will automatically create two objects, weight_mat_i, weight_mat_j. In weight_mat_i, each row is assumed to be the target node, and in weight_mat_j, each row is assumed to be the neighbor node. In our case, row i of result is the node i’s weight as a neighbor node in all other nodes’ aggregation processes. So we are assuming each row to be the neighbor node. This means subsequently, we will only use weight_mat_j .

Message Function: The message function is called by the propagation function. It receives an embedding matrix x_j and any other information passed by the propagate function. In x_j , it assumes that each row in x_j is the neighbor nodes, and performs any transformation necessary so the output is the message from that neighbor nodes. In many cases including our GraphConsis, there is no transformation. So the output of the message function is just the input.

Message Function

Again recall our GCN layer formulation in the second line, the message that node u sends as a neighbor node is simply the embedding itself since we are doing the linear transformation at the forward function step post-aggregation.

GCN Layer

Aggregate Function: The aggregate function is also called by propagate and performs the aggregation process. It receives an inputs of shape [num_of_edges_in_graph, dimension_of_previous_layer_embedding], and indexof shape [num_of_edges_in_graph] . Each row in inputis a particular message sent from a neighbor. This means there could be many duplicated rows in the inputs , because a particular node ‘A’ could send messages to many other nodes since ‘A’ could be the neighbors of many nodes. The value in index indicates what is the target node index for the corresponding row in inputs . i.e.: The value stored at index[i] specifies the target node that the message at inputs[i] is sending to. In our aggregation function due to the modification we made to propagate passing, we also receive weight_mat_j.

Aggregation with Neighbor Sampling

Lines 1 to 6 construct the neighbor sampling weights. Similar to inputs, each row in weight_mat_j is a particular message sent from a neighbor, and the value in index[row_index_in_weight_mat_j] indicates what is the target node of the message. Our message in weight_mat_j is the weights for sampling. Column j of weight_mat_j contains the weight to be applied if the center node is j. So for a specific row i, the message is inputs[i] , the target node this message is sending to is index[i] , and the weight for this message for target node ( index[i] ) is stored at weight_mat_j[i][index[i]]. This is what lines 3 and 4 do. After weighting the inputs with proper weight at line 6, we proceed to perform the mean aggregation using scatter. This concludes the implementation of neighbor sampling on node embeddings.

After 50 epochs of training (no hyperparameter turning), our model has an accuracy of 57% but recovered 77 fraud labels.

Yay! We made it to the end! What else?

This blog provided a quick overall of the PyG Graph ML capability, from using pre-built models all the way to fully customized models. Below is a quick summary of the performances of our three models. Note that the performances are mostly for illustrative purposes. Very little hyperparameter tuning and architecture explorations were done.

Model Accuracy Summary

Thanks for sticking with us this far. We hope that this blog and the accompanying Colab implementation details will serve as a beginning point to jumpstart your graphML model. Have fun exploring!

References:

  • PyG(PyTorch Geometric) official website
  • Pre-processed YelpChi data matrix can be found in the Github of the GraphConsis paper.
  • 2 layer GNN Figure is copied from Stanford CS224W lecture notes
  • GraphConsis paper can be found here
  • All other images are self-constructed or obtained from open license images at unsplash.com
  • Implementation Colab

--

--