CS 224W: Unveiling the Patterns in Random Graphs

Kanu Grover — Wed, 06 Dec 2023 22:13:06 GMT

Authors: Shayan Talaei, Kamyar Rajabalifardi, Kanu Grover

Colab Notebook

Introduction

In this blog post, we discuss an application of graph machine learning techniques in random graph detection. Our primary focus is on developing a Graph Neural Network (GNN) that can accurately classify different types of random graphs. In particular, we are keen on understanding three families of random graphs — Erdős–Rényi (ER), Barabasi-Albert (BA), and Stochastic Block Models (SBM). Each of these families exhibits unique characteristics, making them ideal candidates for our study.

A high-level outline of our project is as follows:

Overview of Random Graphs: In this section, we’ll describe the characteristics of the three random graph families by delving into the mathematics behind each one. Additionally, we will explain the PyG modules tailored for generating samples of graphs from each family. An important element in our discussion will be the exploration of various parameter choices, which play a crucial role in shaping the unique patterns that each family exhibits.
Modeling: In this section, we will tackle a two-fold modeling problem. We will first embark on a classification task that aims to predict the family that some particular graph is sampled from. After exploring various architecture choices for this model, we will introduce a variation that performs regression to estimate the unknown parameters of that family. We’ll analyze some results from our work, describe some pitfalls, and propose additional questions for further study.

Without further ado, let’s get into it!

Overview of Random Graphs

Random graphs exist everywhere in the world around us — in everything from epidemiology to social network analysis. Many real-world phenomena can be modeled using random graphs, and understanding the underlying structure of these networks can provide invaluable insights into their behavior. Listed below are a few examples of where random graphs are especially useful in modeling real-world events.

Epidemiology: Random graphs are useful in understanding and modeling stochastic processes that happen over a network, as with disease transmission during epidemics. By simulating transmission over a network, researchers can determine the effectiveness of various treatments and interventions on a population.
Social Networks: Various types of social networks can be modeled as random graphs. By modeling nodes as people and edges as connections/friendships, random graphs can give insight into how social networks form, multiply, and spread information.
Neural Subgraph Matching: Distinguishing between the structured patterns and randomness in real-world graphs has helped researchers measure the presence of significant motifs in the area of neural subgraph matching.

Random graphs are instances of random graph families, which are broadly defined in the following sections. Each family is uniquely parameterized by a different distribution — this allows us to generate a rich variety of random graphs that closely resemble real-world networks.

In the following sections, we’ll describe properties of the three families of random graphs — Erdős–Rényi, Barabasi-Albert, Stochastic Block Models — and explain the various PyG functionalities surrounding them.

Erdős–Rényi

Erdős–Rényi (ER) graphs are perhaps the most simple and intuitive class of random graphs. ER graphs are defined by two required parameters:

n (int): the total number of nodes in the graph.
p (float): the probability of an edge between any two nodes.

Due to their simplicity, ER graphs are unable to model many complex, real-world relationships, but they are important in the field of graph theory. They often serve as a benchmark for the study of more complex graph structures and random graphs.

Implementation

PyG provides a simple implementation of an ER Graph Generator.

# Below is PyG's implementation of the ER Graph Generator

from torch_geometric.data import Data
from torch_geometric.datasets.graph_generator import GraphGenerator
from torch_geometric.utils import erdos_renyi_graph

class ERGraph(GraphGenerator):
    r"""Generates random Erdos-Renyi (ER) graphs.
    See :meth:`~torch_geometric.utils.erdos_renyi_graph` for more information.

    Args:
        num_nodes (int): The number of nodes.
        edge_prob (float): Probability of an edge.
    """
    def __init__(self, num_nodes: int, edge_prob: float):
        super().__init__()
        self.num_nodes = num_nodes
        self.edge_prob = edge_prob

    def __call__(self) -> Data:
        # erdos_renyi_graph() constructs the edge index of the random graph
        # See below for a description of the erdos_renyi_graph function
        edge_index = erdos_renyi_graph(self.num_nodes, self.edge_prob)
        return Data(num_nodes=self.num_nodes, edge_index=edge_index)

    def __repr__(self) -> str:
        return (f'{self.__class__.__name__}(num_nodes={self.num_nodes}, '
                f'edge_prob={self.edge_prob})')

We can construct an ER graph (n=6, p=0.6) as follows:

from torch_geometric.datasets.graph_generator import ERGraph
er_graph_generator = ERGraph(6,0.6)

Here, the er_graph_generator variable simply defines an instance of ERGraph; to produce a graph sampled from the ER distribution specified by n and p, we must implicitly invoke the __call__ function, as follows.

er_graph = er_graph_generator()
print(er_graph.num_nodes, er_graph.edge_index)

In the code above, er_graph is an object of type Data, with class attributes num_nodes and edge_index that specify the number of nodes and edge connectivity of the randomly sampled graph, respectively.

Here, we’ve defined a helper function to plot the ER graph.

import matplotlib.pyplot as plt
import networkx as nx
from torch_geometric.utils import to_networkx

def plot_graph(graph: torch_geometric.data.Data) -> None:
  G = to_networkx(graph, to_undirected=True)
  nx.draw(G, with_labels=True)
  plt.show()

plot_graph(er_graph)

ER (n=6, p=0.6) Graph

As shown below, if we modify the edge probability parameter p to 0.9, we observe a more dense graph. A larger p parameter is especially useful for modeling highly interconnected graphs.

ER(n=6, p=0.9) Graph

Earlier, we glossed over how PyG actually generates an ER graph. It does so using the erdos_renyi_graph method in the __call__ function of the ERGraph class. The function is outlined in torch_geometric.utils.random.erdos_renyi_graph. From a high-level, PyG first creates a fully-connected graph, and then masks each edge with probability 1-p. For brevity, we’ve omitted further details.

Barabasi-Albert

Barabasi–Albert (BA) graphs are a more complex, second family of graphs that model various real-world activities, including network growth processes. These family of graphs incorporate two important phenomena observed in many real-world graphs: growth and preferential attachment.

Growth is the ability of a graph to enlarge over time (just as many social networks grow over time).
Preferential Attachment describes how new nodes in the graph tend to connect with existing nodes that are high-degree in nature —analogous to the popular saying “the rich get richer”. More specifically, the probability of forming an edge with an existing node is proportional to the node’s degree. As such, BA graphs tend to have high-degree “hub” nodes, although most nodes in BA graphs have fewer connections.

By nature of their interconnectivity, BA graphs are better at modeling some real-world events than ER graphs. A few examples are listed below.

Citation Networks: Citation networks are “scale-free”, meaning their degree distribution contains a few high-degree hub nodes. Citation networks also tend to display preferential attachment — for example, new papers are likely to cite results from existing, well-known papers.
Internet Networks: Hub nodes in BA graphs are similar to high-traffic websites and social media networks that receive many regular visits. Newer websites are more likely connected to hub nodes over other smaller websites (i.e. there are more outgoing links from a hub node like Google to a smaller website).

Now that we understand BA graphs well, let’s see how we can leverage PyG to randomly generate them!

Implementation

PyG’s implementation of BA graphs is quite similar to that of ER graphs, albeit with different parameters. BA graphs are defined by two required parameters:

n (int): the total number of nodes in the graph.
e (int): the number of edges a new node forms with existing nodes.

# Below is PyG's implementation of the BA Random Graph Generator

from torch_geometric.utils import barabasi_albert_graph
class BAGraph(GraphGenerator):
    r"""Generates random Barabasi-Albert (BA) graphs.
    See :meth:`~torch_geometric.utils.barabasi_albert_graph` for more
    information.

    Args:
        num_nodes (int): The number of nodes.
        num_edges (int): The number of edges from a new node to existing nodes.
    """
    def __init__(self, num_nodes: int, num_edges: int):
        super().__init__()
        self.num_nodes = num_nodes
        self.num_edges = num_edges

    def __call__(self) -> Data:
        edge_index = barabasi_albert_graph(self.num_nodes, self.num_edges)
        return Data(num_nodes=self.num_nodes, edge_index=edge_index)

    def __repr__(self) -> str:
        return (f'{self.__class__.__name__}(num_nodes={self.num_nodes}, '
                f'num_edges={self.num_edges})')

Below, we’ve constructed a BA graph (n=10, e=1):

from torch_geometric.datasets.graph_generator import BAGraph
ba_graph_generator = BAGraph(10, 1)
ba_graph = ba_graph_generator()
plot_graph(ba_graph)

BA (n=10, e=1) Graph

With a small value of e, we are more likely to observe a possibly unconnected graph with only a small number of high-degree nodes. The unconnectedness may seem counter-intuitive — when node 1 is added to the graph, shouldn’t it connect to node 0? The answer is no. If we look at torch_geometric.utils.random.erdos_barabasi_albert_graph, we’ll see that when e is small, depending on the initialization of the nodes and edge connections, some nodes may not be connected at all!

As e increases to 20 (similarly n is changed to 50), we notice many more high-degree hub nodes in the BA graph. The nodes with fewer edges tend to be the ones added earlier in the graph’s formation. This is because there were only a limited number of existing nodes for connection at that time.

BA (n=50, e=20) Graph

Stochastic Block Models

Stochastic Block Models (SBM) are the final class of random graphs, distinguished primarily by their node clusterings. In SBMs, nodes are divided into a predetermined number of groups, and the nodes within each group share similar properties. Node clusterings in a SBM have a high degree of connectivity within each group (these edges are referred to as intra-group connections). In most SBMs observed in the real-world, edges from one group to another (inter-group connections) occur less frequently than intra-group connections.

Due to their high degree of clustering connectivity, SBMs are useful in community detection applications. A few examples are listed below.

Social Networks: In social network analysis, individuals are often grouped into communities depending on their shared interactions. SBMs can help discover patterns and underlying trends between users in similar groups, and help inform strategies to cluster future nodes together.
Recommender Systems: SBMs may be used in recommender systems to find clusters of users with a similar purchase history, which may be useful in informing future recommendations for individuals within the group. This community-based approach is able to capture complex patterns in users and items.

Implementation

SBM graphs are defined by two required parameters:

block_size ([int]): the total number of nodes in each cluster. The length of this list is assumed to be the number of clusters in the graph.
edge_probs ([[float]]): the density of edges going from each block to each other block.

Unfortunately, PyG does not provide a class to randomly generate SBM graphs, as it does for ER and BA graphs. Below, we propose an addition to the PyG module torch_geometric.datasets.graph_generator that does exactly this.

from torch_geometric.utils import stochastic_blockmodel_graph
from typing import List, Union

# Below is our custom addition of the SBM Graph Generator

class SBMGraph(GraphGenerator):
    r"""Generates random stochastic_blockmodel_graph (SBM) graphs.
    See :meth:`~torch_geometric.utils.stochastic_blockmodel_graph` for more
    information.

    Args:
        block_sizes ([int] or LongTensor): The sizes of blocks.
        edge_probs ([[float]] or FloatTensor): The density of edges going
            from each block to each other block. Must be symmetric if the
            graph is undirected.
        directed (bool, optional): If set to :obj:`True`, will return a
            directed graph. (default: :obj:`False`)
    """

    def __init__(self, block_sizes: Union[List[int], torch.LongTensor],
                 edge_probs: Union[List[List[float]], torch.FloatTensor],
                 directed: bool = False):
        super().__init__()
        self.block_sizes = block_sizes
        self.edge_probs = edge_probs
        self.directed = directed

    def __call__(self) -> Data:
        edge_index = stochastic_blockmodel_graph(self.block_sizes,
                                                 self.edge_probs,
                                                 self.directed)
        return Data(num_nodes=np.sum(self.block_sizes), edge_index=edge_index)

    def __repr__(self) -> str:
        return (f'{self.__class__.__name__}(block_sizes={self.block_sizes}, '
                f'edge_probs={self.edge_probs}, directed={self.directed})')

The code that generates the SBM exists in the __call__ method. We take advantage of torch_geometric.utils.stochastic_blockmodel_graph to generate a random set of edge connections sampled from a SBM distribution, which we transform into a Data object — this is eventually returned from the __call__ function.

Below, we’ve plotted a SBM with parameters specified in the code cell below. Notice how the three clusters of nodes contain a significantly higher amount of intra-group connections than inter-group connections.

sbm_graph_generator = SBMGraph([10, 10, 10], [[0.90, 0.05, 0.05], [0.05, 0.90, 0.05], [0.05, 0.05, 0.90]])
sbm_graph = sbm_graph_generator()
plot_graph(sbm_graph)

SBM Graph (parameters specified above)

However, when we assign an equal probability of edge connection from each node cluster to every other, we end up with a graph that doesn’t resemble a SBM at all — this is just an Erdős–Rényi graph!

sbm_graph_generator = SBMGraph([10, 10, 10], [[0.33, 0.33, 0.33], [0.33, 0.33, 0.33], [0.33, 0.33, 0.33]])
sbm_graph = sbm_graph_generator()
plot_graph(sbm_graph)

SBM Graph (parameters specified above)

Data Preparation

The section above provides a comprehensive review of random graphs and PyG. We will now propose a practical application of random graphs in the field of random graph detection.

Question: Can we design a Graph Neural Network (GNN) capable of accurately classifying the family that a randomly sampled graph originates from? Can we predict the parameters associated with that family?

Dataset Generation

In order to tackle this problem, we will begin by generating synthetic data from each of the three graph families. To design a robust GNN capable of distinguishing different graph families, we will diversify our dataset by incorporating example graphs with a range of parameters.

In our code, we implement a class RandomGraphDataset which generates multiple random graphs sampled from a particular graph family with specified parameters. By inheriting from InMemoryDataset, we’ve created a versatile PyG class that is useful for generating random graphs in a multitude of other contexts. The _download, _process, __len__, and __getitem__ functions define this interface, and we’ve adapted these functions for our particular use case.

import torch
import os
from torch_geometric.data import Data, InMemoryDataset

class RandomGraphDataset(InMemoryDataset):
    def __init__(self, graph_generator, num_samples, seed, root):
        super(RandomGraphDataset, self).__init__(root, None, None)
        self.graph_generator = graph_generator
        self.num_samples = num_samples
        self.seed = seed
        self.root = root

        if isinstance(graph_generator, ERGraph):
            self.graph_generator_params = {'name': "ER", 'num_nodes': graph_generator.num_nodes, 'p': graph_generator.edge_prob}
        elif isinstance(graph_generator, BAGraph):
            self.graph_generator_params = {'name': "BA", 'num_nodes': graph_generator.num_nodes, 'm': graph_generator.num_edges}
        elif isinstance(graph_generator, SBMGraph):
            self.graph_generator_params = {'name': "SBM", 'block_sizes': graph_generator.block_sizes, 'edge_probs': graph_generator.edge_probs, 'directed': graph_generator.directed}
        else:
            raise ValueError('graph_generator must be an instance of ERGraph, BAGraph or SBMGraph')

        file_path = "_".join([str(value) for value in self.graph_generator_params.values()]) + "_" + str(seed) + "_" + str(num_samples)
        # print(file_path)
        self.data_list = []

        # if the file exists in the root folder, load it
        if os.path.isfile(os.path.join(self.root, file_path + '.pt')):
            self.data_list = torch.load(os.path.join(self.root, file_path + '.pt'))

        else:
            self._download(file_path)
            self._process()

    def _download(self, download_path):
        # set the seed
        torch.manual_seed(self.seed)
        # generate the graphs
        self.data_list = [self.graph_generator() for _ in range(self.num_samples)]

        # save the graphs
        torch.save(self, os.path.join(self.root, download_path + '.pt'))

    def _process(self):
        pass

    def __len__(self):
        return len(self.data_list)

    def __getitem__(self, idx):
        data = self.data_list[idx]
        return data

The RandomGraphDataset takes in a few important parameters:

graph_generator: an instance of the graph generator family, which includes the parameters of the family. These parameters are contained as attributes of the graph_generator instance (i.e. graph_generator.num_nodes or graph_generator.edge_prob).
num_samples: the number of random graphs to generate from the family specified in graph_generator.

This class will either extract the previously generated data from the file directory, or call the _download function to create and download the data. The random graphs are saved in a .pt file. The names of these files contain the ground truth parameters of the distribution they were sampled from. For instance, the file ER_5_0.3_100.pt contains 100 Erdős–Rényi graphs with 5 nodes, whose edges exist with a probability of 0.3. We then save these files in our root directory, so we may access them with ease.

We create multiple such files of data, where each file consists of 100 randomly sampled graphs from a fixed graph family with a particular set of parameters (for example, one such file was ER_5_0.3_100.pt). For brevity, the code that generates these files is excluded here, but can be accessed through our Google Colab linked above.

Incorporating Position-Aware Features

Before combining the files into one larger dataset, we critically add positional features to the node embeddings for each graph (via add_position_features). Provided below is our implementation of this function — it’s worth noting that PyG does not currently support this functionality.

import torch
import numpy as np
import networkx as nx
from torch_geometric.utils import from_networkx

def add_position_features(data, num_anchors=10):
    # Convert to NetworkX graph
    graph = to_networkx(data)


    # Select anchor nodes
    anchor_nodes = np.random.choice(graph.nodes, size=num_anchors, replace=False)

    # Calculate distances to anchor nodes
    distances = []
    for node in graph.nodes:
        node_distances = []
        for anchor in anchor_nodes:
            try:
                distance = nx.shortest_path_length(graph, source=node, target=anchor)
            except nx.NetworkXNoPath:
                distance = len(graph.nodes)#float('inf')
            node_distances.append(distance)
        distances.append(node_distances)

    # Convert to tensor
    distances_tensor = torch.tensor(distances, dtype=torch.float)

    # Normalize distances
    min_val = torch.min(distances_tensor)
    max_val = torch.max(distances_tensor)
    distances_tensor = (distances_tensor - min_val) / (max_val - min_val)

    try:
      # Add to node features
      data.x = torch.cat([data.x, distances_tensor], dim=1)
    except:
      data.x = distances_tensor

    # Add a small noise
    data.x = data.x + 0.001 * torch.randn(data.x.shape)

    return data

This step is vital in enabling the GNN to discriminate between random graph families. Without positional embeddings, our experiments have shown that a basic GNN is unable to differentiate between two families of graphs that have a similar number of nodes and average node degree. In the example below, the GNN may take an incorrect shortcut in classifying the graphs by relying heavily on the node count and average degree, but incorporating position-aware features helps mitigate this problem.

A GNN will struggle to distinguish between the two graphs (ER — left, SBM — right) because they have the same average node degree.

Modeling (Classification / Regression)

We are finally ready to begin our modeling task! We will first embark on a classification task, where our objective will be to accurately predict the random graph family that some arbitrary graph was sampled from. Then, we will attempt to predict the family’s parameters.

Classification

Our initial task involves solving a graph-level classification task. First, we will compute the individual node embeddings and apply a global pooling operator to generate a graph-level embedding. We will then design a classification head that returns the most probable graph family.

Here, we consider two architectures to generate embeddings — a simple Graph Convolutional Network (GCN) and a more sophisticated message-passing architecture, Graph Isomorphism Network (GIN). After trial-and-error, we found that the results from the GIN model surpassed the results of the GCN. This was expected — GIN is known to have one of the most powerful message-passing architectures (per the Weisfeiler-Lehman Test).

The GNN architecture is shown below.

from torch.nn import Linear
import torch.nn.functional as F
from torch_geometric.nn import GINConv
from torch_geometric.nn import global_mean_pool


class GIN(torch.nn.Module):
    def __init__(self, num_node_features, hidden_channels):
        super(GIN, self).__init__()
        torch.manual_seed(12345)
        self.conv1 = GINConv(Linear(num_node_features, hidden_channels))
        self.conv2 = GINConv(Linear(hidden_channels, hidden_channels))
        self.conv3 = GINConv(Linear(hidden_channels, hidden_channels))

        self.lin = Linear(hidden_channels, dataset.num_classes)

    def forward(self, x, edge_index, batch):
        # 1. Obtain node embeddings
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = self.conv2(x, edge_index)
        x = x.relu()
        x = self.conv3(x, edge_index)

        # 2. Readout layer
        # do sum pool over all nodes in each graph
        x = global_mean_pool(x, batch)

        # 3. Apply a final classifier
        x = F.dropout(x, p=0.2, training=self.training)
        x = self.lin(x)

        return x

After partitioning our examples into a training, validation, and testing set, we formulated a training and evaluation loop to fine-tune our model.

The results of our final model are listed below.

num_classes = train_dataset.num_classes
dataset_x_features = train_dataset[0].x.shape[1]

model = GIN(num_node_features=dataset_x_features,
            hidden_channels=64,
            out_channels=num_classes)

Training Loss of the GNN Classifier

Training / Testing Accuracy of the GNN Classifier

Further Study: Although our classifier performed very well, there are a few areas where we can improve. In particular, our model has difficulty classifying very dense graphs. As we increase the number of edges in a graph, families of random graphs begin to resemble each other. This is a hard problem to solve, and position-aware node embeddings seemed to offer the largest improvement in accuracy. Supplementing this approach with a random-walk strategy may help.

Moreover, we would love to extend our project to identify whether some real-world graphs can be approximated with random graphs. This would involve supplementing our dataset with positive and negative samples of real-world graphs to help our model better learn the intricacies of this task. Simultaneously, we’d have to increase the complexity of our GNN model. Due to the time constraints of this project, this was not a realistic goal, but it’s something that we’re looking forward to experimenting with in the future!

Regression

In the final part of our project, we are interested in a regression task — predicting the most likely parameters of the random graph family that an example graph was generated from. However, we quickly realized that estimating the parameters of Barabasi-Albert and Stochastic Block Model random graphs is a very complex task. Instead, we invested our time into designing a GNN that can infer the underlying probability of edge connection parameter p in Erdős–Rényi graphs. We deliberately focused on a data-dense regime for this task, only including ER graph examples in our dataset where the parameter p > 0.5. We found that accuracy decreased significantly for p < 0.5.

Our regression model closely resembled the GIN model from the classification task. As this was a regression task, we chose to only have a single output channel that returned the predicted p parameter. Additionally, we chose to minimize MSE loss instead of CLE loss. The final model and results are presented below.

dataset_x_features = train_dataset[0].x.shape[1]

model = GIN(num_node_features=dataset_x_features,
            hidden_channels=64,
            out_channels=1)
model = model.to(device)

Regression Training Loss

Though it’s hard to discern the precise empirical loss from the plot above, our GNN model achieved a remarkably low MSE loss after 100 epochs. Below, we compare the loss of our GNN model with that of the trivial p prediction derived from the average node degree of the graph. While our GNN didn’t perform quite as well, it still showed comparable results. With an enhanced architecture, we’re hopeful to surpass this benchmark.

Further Study: One pitfall of our regression model was that it had difficulty estimating the p parameter for very sparse Erdős–Rényi graphs. This makes intuitive sense, as graphs with fewer edges naturally lead to less precise estimates and greater variability. Once again, this is a difficult problem to solve, and likely requires increasing the complexity of the GNN, among overcoming other challenges.

Looking ahead, we’re eager to continue our work to estimate model parameters for the more intricate random graph families (Barabasi-Albert, Stochastic Block Model). While time constraints prevented us from delving into this area during our current project, it remains an exciting and promising direction for our future endeavors!

Conclusion

In this project, we explored the field of random graphs — particularly the Erdős–Rényi, Barabasi-Albert, and Stochastic Block Models. We thoroughly examined the characteristics and parameters of each graph family. Our journey included a detailed exploration of the PyG’s source code to generate samples from each random graph family. We also introduced several potential functional enhancements to PyG!

In the second half of our project, we built a position-aware GNN to predict the family that an arbitrary random graph was sampled from, and began to embark on the process of estimating the parameters. While we achieved some promising results, we recognize the need for more time and a sophisticated model to fully address both challenges. Nonetheless, we’re excited to build off of this work in the future.

Hope you enjoyed!

Stories by Kanu Grover on Medium

CS 224W: Unveiling the Patterns in Random Graphs

Introduction

Overview of Random Graphs

Erdős–Rényi

Barabasi-Albert

Stochastic Block Models

Data Preparation

Dataset Generation

Incorporating Position-Aware Features

Modeling (Classification / Regression)

Classification

Regression

Conclusion