Graph Machine Learning for Credit Card Fraud Analysis

Great Learning
17 min readSep 23, 2021

--

Author: Kovendhan Venugopal

Graphs are everywhere

Right from the Social Networks to Protein Molecular Structures to Financial Transactions, Graphs are everywhere.

Analysis and detection of fraudulent transactions is a herculean task, given the complexity and number of factors involved. In this post, we will look at how Graphs can help us to accomplish this complex task. It is assumed that the reader is aware of Basic Machine Learning concepts.

The write-up is structured into the following topics:

  • An Introduction to Graph Analytics
  • What is Graph Machine Learning?
  • Hands-on Credit Card Fraud Analysis using Graph Machine Learning
  • About the Dataset
  • Exploratory Data Analysis
  • Creating Graph Nodes and Edges from Dataframe
  • Network Analysis
  • Community Detection for Fraud vs Benign Nodes
  • Model Building — Supervised Learning
  • Model Building — Unsupervised Learning
  • Summary

An Introduction to Graph Analytics:

A Graph is a special kind of mathematical structure, which is used to represent relationships between entities. A Graph is made up of nodes and edges where the nodes represent entities and edges represent relationships between the entities.

Before we take the plunge, let us look at the fundamental concepts of Graphs.

Undirected Graph:

A simple undirected graph G is defined as a couple

G = (V,E),

where,

  • V = {V1, V2,…Vn} is a set of nodes (also called vertices) and
  • E ={{Vk, Vw}, …..{Vi, Vj}} is a set of two sets of edges (also called links), representing the connection between two nodes belonging to V

The order of a graph is the number of its vertices (nodes).

The size of a graph is the number of its edges (links).

The degree of a vertex is the number of edges that are adjacent to it.

The neighborhood graph (also known as ego graph) of a vertex v in a graph G is a subgraph of G, composed of the vertices adjacent to v and all edges connecting vertices adjacent to w.

Digraph:

A Digraph G is defined as a couple

G = (V,E)

where,

V = {V1,V2,….Vn} is a set of nodes and

E = {(Vk, Vw),….(Vi, Vj)} is a set of ordered couples representing the connection between two nodes belonging to V.

Since each element of E is an ordered couple, it enforces the direction of the connection.

k-th partite Graphs:

A bipartite or tripartite or in general, k-th partite graphs are in which the vertices can be partitioned into two, three, or more k-th sets of nodes.

What is Graph Machine Learning?

Machine Learning is a subset of Artificial Intelligence (AI) that provides machines the ability to learn and improve from the input data.

The difference between traditional programming and a Machine Learning system is that the latter takes both data and expected results and provides a set of rules that will closely model the problem as a mathematical function. In short, Machine Learning models are mathematical functions learned from the input data.

With the increase in complexity of the problems at hand, it becomes imperative to leverage the capabilities of Machine Learning models. This becomes more pronounced when it comes to complex data structures such as Graphs. Machine Learning for Graphs comes under a broader umbrella of ‘Representational Learning’.

Now that we have got a brief gist of Graph Machine Learning, let us look at how Graph Machine Learning can help us to do Credit Card Fraud Analysis.

Hands-on Credit Card Fraud Analysis using Graph Machine Learning

For the demonstration purpose, we will be working with an open-source dataset available in Kaggle.

About the Dataset

The dataset is made up of simulated credit card transactions for the period 01-Jan-2019 to 31-Dec-2020. It contains both legitimate and fraudulent transactions of 1000 customers with a pool of 800 merchants. The dataset is generated using Sparkov Data Generation | Github tool created by Brandon Harris.

Dataset: https://www.kaggle.com/kartik2112/fraud-detection

Data Dictionary:

index — Row ID

cc_num — Credit Card Number of the customer

merchant — Merchant Name

amt — Transaction Amount (in USD)

is_fraud — Target Variable. (0=genuine, 1=fraudulent)

Library Requirements:

networkx ==2.5

scikit-learn==0.24.0

pandas==1.1.3

node2vec==0.3.3

numpy=1.19.2

communities==2.2.0

Objective — Fraudulent Transactions as Edge Classification Problem:

We will be representing the transactions as graphs with -

  • Customers & Merchants as Nodes
  • Transactions as Edges
  • Weight of the Edges in line with Amount of Transaction carried out
  • Label of the Edges as ‘Genuine’ or ‘Fraudulent’.

Hence, our objective of detecting fraudulent credit card transactions can be translated into a ‘Node Classification’ task in Graph Machine Learning parlance.

Exploratory Data Analysis

Step 1: Load the libraries

Step 2: Load the dataset

From the original dataset, we have selected 20% of the genuine transactions and all of the fraudulent transactions for demonstration purposes.

Step 3: Data Intuition

The dataset has 23 features and 265,340 records. No null values are present in the dataset.

Step 4: Target Variable Distribution

Out of the total 265,340 records, 257,834 are genuine and 7506 are fraudulent. Hence, the dataset exhibits class imbalance with 2.83% of fraudulent transactions.

The dataset can be converted into a graph network using the networkx python library.

Data Preparation — Creating Graph Nodes and Edges from Dataframe

As discussed in the introduction section, there are multiple ways to represent a graph structure namely —

  • Bipartite
  • Tripartite
  • k-th partite

Since all the transactions in our Credit Card dataset are temporal in nature (time-based), the assumption is that multiple transactions can happen at the same time.

We will be representing our network as both bipartite and tripartite representations and later finalize the better representation structure based on graph metrics (more in subsequent sections). For our representations, we will be collapsing multiple transactions into a single edge for demonstration purposes.

G = (V, E, w)

  • where, V = Vc U Vm, where each node c ∈ Vc represents a customer,
  • each node m ∈ Vm represents a merchant Vm.

For each of the graph edges, we will assign a positive weight corresponding to the size of the transaction amount carried out.

The following functions will convert the dataframe into bi & tripartite graphs.

  • The networkx library has the inbuilt function from_edgelist that will take the from and to nodes from pandas dataframe.
  • In order to set the node attributes, we will be using the inbuilt set_node_attributes() function.
  • node_id is mapped to each of the customer and merchant
  • Edge attributes — weight represents the total number of transactions between two nodes; label represents the target variable — whether the transaction is genuine or fraudulent.

The bipartite and tripartite graphs can be built using the below code snippet. The syntax remains the same except for the usage of nx.Graph() for Undirected graphs and nx.DiGraph() for Directed graphs.

Now that we have built both Bipartite and Tripartite graphs, how can we validate the actual structure of our graph? Again, the built-in function of the networkx library called bipartite comes to our rescue.

In line with our assumption, our graph is indeed a bipartite one as the is_bipartite() returns True.

Network Topology — Statistical Analysis

Now that we have built our graphs, let us get into the Network Topology analysis using statistical methods for graphs.

We will start with getting to know about the Graph attributes for bipartite and tripartite networks. This can be done using nx.info() function in networkx.

As expected, both the topologies are different, with the bipartite graph having 1676 nodes (customers and merchants) and 201,725 edges (transactions) between the customers and merchants; whereas the tripartite graph having 267,016 nodes (customers, transactions, and merchants) and 530,680 edges (transactions).

The average degree for bipartite is 240.7220, which means that for each node, there are approximately 240 adjacent edges. Similarly, the average degree for tripartite is 3.97, indicating the degree is less mainly due to the fact that we are including the intermediary node of transactions also in addition to the customers and merchants.

In other words, since the connections between merchants and customers are split by the presence of transaction nodes, the average degree is on the lower side.

Nodes Degree Distribution:

In order to gain deeper intuition about our graph structures, we start with plotting the distribution of the node degrees for both bipartite and tripartite graph structures.

The following code generates the histogram of node degrees using the inbuilt function nx.degree().

Bipartite Nodes Degree Distribution:

The bipartite graph has a more variegated distribution, with a peak of around 300.

Tripartite Nodes Degree Distribution:

The tripartite graph has a peak for degree ‘2’, while the rest of the degrees exhibit a similar distribution as that of the bipartite graph.

If the bipartite graphs are made by connections from customer to merchant, the tripartite graphs are made by connections from customer to transaction to merchant. Hence, the bin representing degree ‘2’ is approximately equal to the number of transaction nodes present in the dataset.

Edge Weight Distribution:

Similar to the node-level degree distribution, let us also plot the edge weight distribution for both bipartite and tripartite graphs.

Bipartite — Edges Weight Distribution; Image by Author

Tripartite — Edges Weight Distribution; Image by Author

Due to the aggregation of all transactions in bipartite, the distribution is slightly shifted to the right (right-skewed) when compared to the tripartite where the transaction nodes are more pronounced.

Information Extraction from Graphs:

We started with a dataframe and then used networkx library to convert it into a graph. Then, we will extract graph-specific metrics from the dataset such as Degree of the node, Betweenness Centrality, Closeness centrality coefficient, Eigenvector centrality, etc.

Degree Centrality:

Degree centrality is defined as the number of links incident upon a node (i.e., the number of ties that a node has). It is calculated using the inbuilt function nx.degree_centrality().

Betweenness Centrality:

It measures the number of shortest paths that pass through a given node, providing an intuition about how ‘central’ that node is for message passing within the network.

It is calculated using the inbuilt function nx.betweenness_centrality().

Closeness Centrality:

Closeness centrality is a way of detecting nodes that are able to spread information very efficiently through a graph. The closeness centrality of a node measures its average farness (inverse distance) to all other nodes. Nodes with a high closeness score have the shortest distances to all other nodes. In networkx, it is calculated using the inbuilt nx.closeness_centrality().

Eigenvector Centrality:

The eigenvector centrality captures the centrality for a node based on the centrality of its neighbors. This means that a node with a high score is a node that is connected to other influential nodes with high scores as well.

The eigenvector centrality for node i is

where A is the adjacency matrix of graph G with eigenvalue λ. By virtue of the Perron–Frobenius theorem, there is a unique and positive solution if λ is the largest eigenvalue associated with the eigenvector of the adjacency matrix A. In networkx, it is calculated using the inbuilt nx.eigenvector_centrality().

Assortativity

Assortativity, or assortative mixing is a preference for a network’s nodes to attach to others that are similar in some way. Though the specific measure of similarity may vary, network theorists often examine assortativity in terms of a node’s degree.

Community Detection & Fraudulent Edges Ratio Analysis:

Next, we will perform Community Detection to understand the interaction among the transactions in our Credit Card dataset. We will use community library to do this. This library can be installed using the below pip command.

pip install python-louvain

First, we will import the library, then the communities are extracted based on the edge weights using community.best_partition command.

Now, let us look at the number of communities that are detected by the community.best_partition method for the bipartite graph. There are a total of 10 communities detected, sorted in the order of their complexity. Community 4 is having the highest weight whereas Community 3 has the lowest.

Communities List from the Graph

Let us plot the communities as a histogram to understand the frequency distribution of different bins. We can see that there are a lot of communities with less than ~100 nodes and we also have a sparse distribution with few communities over 300 and 500 nodes. Communities with greater than 500 nodes can be considered anomalies in our dataset.

For each of the communities detected by the algorithm, we can compute the ratio of fraudulent transaction to benign transactions. This will help us to understand and identify specific sub-graphs present in the dataset with a higher concentration of fraudulent transactions.

The following code generates a node-induced subgraph by using the nodes present in a particular community. Since in bipartite graph, all the transactions are denoted by the corresponding edges between the nodes, we will take the fraud_edges / tmp.number_of_edges() for computing the fraudulent ratio percentage.

The output of the above code is given as follows. Community 1 has the most fraudulent transaction percentage with 26.50%, closely followed by Community 9 and Community 3. Community 4 has the least percentage of fraudulent transactions with just 1.54% of the total transactions being fraudulent.

The above fraudulent % is provided at the overall community level. We can go one more step further and plot the node-induced subgraphs for a particular community. In the below code, gId indicates the Community ID.

  • Benign edges → Represented by Green
  • Fraudulent edges → Represented by Red

Community ID (gID) = 1

Though Community 1 had the highest percentage of fraud transactions, by looking at the graph, we can understand that the volume density is very less. Now let us plot Community 9, which had the second most percentage of fraud transactions.

Community ID (gID) = 9

In Community 9, we can see that, both the volume and number of fraud transactions are very high when compared to Community 1. Finally, let’s visualize Community 4, which had the least percentage of fraudulent transactions.

Community (gID) = 4

In Community 4, though the number of transactions (edges) is densely connected, still only 1.54% of it is made up of fraudulent transactions.

Key takeaways from the Community Detection Exercise:

  • Community Detection is similar to the Clustering method in Unsupervised Learning.
  • Ten communities are identified with various levels of density
  • All the communities are evaluated for the Ratio of Fraud to Benign transaction edges and shown in plots.
  • We detected few communities (clusters) that have a high ratio of benign transactions and few communities with a high ratio of fraudulent transactions.

Now that we have seen enough Exploratory Data Analysis on the dataset, let us get into the main part of building a predictive model using Supervised Learning.

Model Building — Supervised Learning

Detecting fraudulent transactions is a type of Binary Classification task, wherein we classify each node as benign or fraudulent.

The Machine Learning pipeline will constitute the below tasks:

  • Handling class imbalance
  • Using Graph Embedding to create feature vectors for each of the edges
  • Predictive Modeling

Handling Class Imbalance:

Since our dataset deals with fraudulent vs benign transactions, it is very much natural for the distribution to be skewed in favor of the benign transactions. In order to ensure fairness in our Machine Learning model building process, it is pertinent to balance both classes via resampling methods.

We will be using the simple resampling method from scikit-learn library.

The code is straightforward — we take the fraudulent and benign transaction subsets, represented by df.is_fraud ==0 and df.is_fraud == 1 respectively. Then we keep the fraudulent transactions untouched and downsample the benign class. Finally, a bipartite graph is built on the downsampled dataset that has both classes in equal proportions.

Train Test Split:

The train-test split is a method for evaluating the performance of a Supervised Machine Learning model by splitting the original dataset into Train and Test splits. The model will be trained on the train set and evaluated on the unseen test set. We can use the inbuilt scikit-learn function train_test_split to accomplish this. The test_size=0.20 parameter indicates a train test ratio of 80:20.

Once the train and test datasets are created, we are creating train_graph by extracting all the nodes (basically a sub-graph) corresponding to the train edges.

Graph Embeddings:

Machine Learning models need a numeric representation of the input data in a vector format in order to learn the patterns present in the data.

For numeric tabular datasets, this is achieved using straightforward values available in the dataset. In the case of text datasets, the representation is done via Word Embeddings such as Word2Vec, GloVe, FastText, Elmo, etc.

Since our dataset is in a graph structure, we cannot directly make use of the graph representation. The graph representation for Machine Learning models is achieved using the concept of Graph Embeddings.

There are various ways in which a graph can be represented into a numeric vector. It is achieved at multiple constituent levels of the graph such as at node level, edge level, or at a graph level. The corresponding embedding algorithms are node2vec, edge2vec, and graph2vec.

We will be using node2vec embedding for our dataset.

Source: node2vec original paper https://arxiv.org/pdf/1607.00653.pdf

For more details on the node2vec algorithm, below research paper can be used as a reference.

Node2Vec paper: https://arxiv.org/pdf/1607.00653.pdf

The above piece of code will generate walks on the graph structure to fit on the training data. The term ‘walk’ refers to traversing from one node to another, across the entire graph — we can imagine it as a person walking across the graph, trying to learn each node and edges between the nodes.

Now that, our graph is converted into a numerical representation, we can start building Machine Learning models to evaluate the test set.

The node2vec results are used to build the edge embeddings, which will generate the final feature space for our Classification algorithm. Following edge, embeddings are used in our model namely -

  • Hadamard Embedder
  • Average Embedder
  • Weighted L1 Embedder
  • Weighted L2 Embedder

The mathematical equations governing the edge embedding methods are given in the below table. Explaining each of the edge embedders is beyond the scope of this article.

Source: node2vec original paper https://arxiv.org/pdf/1607.00653.pdf

Supervised Model Building:

We will be using the famous Random Forest Classification algorithm, which will make use of the different edge embedding representations such as Hadamard, Average, L1, L2, etc.

  • We take the weighted vectors of the training dataset using the command model_train.wv, then pass it on as a parameter to each of the Edge Embedding classes and store it in embeddings_train list.
  • Then, we prepare train_embeddings and test_embeddings from the embeddings_train list.
  • Then Random Forest Classifier is initialized with 1000 trees (a hyperparameter represented by n_estimators) and fitted on the train_embeddings generated.
  • Once the training is complete, the classifier is evaluated against test_embeddings.
  • Classification Evaluation metrics such as Precision, Recall, and F1-Score are printed for each of the Edge Embedding class.

The output of the above code is as follows -

Out of all the edge embedding methods used, ‘Average’ Embedding method has given the best evaluation metric scores with 71.6% F1-Score. We can improve the performance metrics by doing hyperparameter tuning — but I will leave it to the readers as a self-exercise.

So, we have converted the network into a graph representation, then trained using Supervised approach and evaluated the model against the unseen test dataset. Now, let us move on to the Unsupervised Learning way of building a Machine Learning model.

Model Building — Unsupervised Learning

The major difference between Supervised & Unsupervised Learning approaches is that there are no ground-truth labels provided in case of an Unsupervised Learning. We also would not be splitting the dataset into train and test datasets.

We will use the entire dataset and try to identify the inherent clusters present in the data using K-Means algorithm.

  • The Node2vec algorithm will be fitted on the entire downsampled dataset. This will convert the dataset into Graph Embeddings representation.
  • Next, we will import KMeans from Scikit-Learn
  • We will store the true labels aka Ground Truth in true_labels list.
  • Then for each of the Edge Embedding methods, we will run KMeans.
  • Identified clusters will then be evaluated using Adjusted Mutual Information Score, Homogeneity Score, Completeness Score, V-Measure Score, etc.

The output of the above piece of code is as follows —

Based on the KMeans clustering, we can see that, Weighted L1 Embedder has performed better than all other embedding methods.

Summary

In this article, we have seen how a Credit Card Fraud Detection problem can be described as a graph problem. To summarize, we started with a credit card transactions dataset in tabular form, converted it into graph nodes and edges using networkx library, then used various statistical visualization techniques to gain data intuition, then transformed the graph structure into embedding vector-based representation, and finally used both supervised and unsupervised learning algorithms to build Machine Learning models.

References

--

--

Great Learning

Great Learning is an ed-tech company for professional and higher education that offers comprehensive, industry-relevant programs.