# OhMyGraphs: GraphSAGE in PyG

Published in

In a (much) earlier post, I described the intuition and some of the math behind a basic graph neural network (GNN) algorithm, GraphSAGE. How can we implement GraphSAGE for an actual task?

I’m a PyTorch person and PyG is my go-to for GNN experiments. For much larger graphs, DGL is probably the better option and the good news is they have a PyTorch backend!

If you’ve used PyTorch before, most of this will be intuitive so let’s jump in!

# Installation

PyG is rapidly being developed and new releases are frequent. I find that I always have some sort of conflict with the various packages needed whenever there is a new release. Below are the versions I’m using in this notebook. You can install with `pip` or `conda` but beware to select the right device version: ie, `cuda10`, `cuda9` or `cpu`. Installation instructions in the docs are here.

`torch                         1.8.0torch-cluster                 1.5.9torch-geometric               1.7.0torch-scatter                 2.0.6torch-sparse                  0.6.9torch-spline-conv             1.2.1`

# The convolution layer

The goal of graph convolution is to change the feature space of every node in the graph. It’s important to realize the graph structure doesn’t change ie, in the before and after visual below, the same nodes are connected to each other. The magic behind graph convolution occurs in how that new feature is computed for each node.

PyG has various types of convolution layers; in this post, we’ll simply go over utilizing the SAGEConv layer which is one iteration of the aggregate-and-update step (see previous post!). You can instantiate one layer of graph convolution by simply specifying the input and output feature shapes expected — very similar to normal convolution in PyTorch.

`from torch_geometric.nn import SAGEConvconv = SAGEConv(input_dim, output_dim)`

A forward pass through the convolution layer requires two things, the node feature matrix, X and the adjacency matrix.

`x = conv(data.x, data.adj_t)`

Recall, the X matrix is an (n x D) matrix where the D is the dimensionality of every node in the graph. Alternatively, if you cannot create an adjacency matrix (since they can explode in size with a large number of nodes!), you can use an edge list. The edge list is expected to be a (2 x n) matrix where the first row in the matrix represents source nodes and the second row represents target nodes.

## What’s happening under the hood?

The default aggregation function for `SAGEConv` is mean aggregation which just means, I’m going to take my neighbours node features and average them (that’s the second term). The update step is simply a linear combination of the neighbours representation and the newly transformed current node representation (the first term). PyG handles the message passing and figuring out the neighbours of every node i etc.

# Creating a model

The GraphSAGE model is simply a bunch of stacked `SAGEConv` layers on top of each other. The below model has 3 layers of convolutions. In the forward method, you’ll notice we can add activation layers and dropout (you could even throw in some batch norm!)

The below model is training a node classification model. This model is effectively trying to get the last layer of the model to have the same number of neurons as there are classes in the dataset. Adding a softmax at the end trains the model to output the most likely class for each node.

`class GraphSAGE(torch.nn.Module):    def __init__(self, in_dim, hidden_dim, out_dim, dropout=0.2):        super().__init__()        self.dropout = dropout        self.conv1 = SAGEConv(in_dim, hidden_dim)        self.conv2 = SAGEConv(hidden_dim, hidden_dim)        self.conv3 = SAGEConv(hidden_dim, out_dim)        def forward(self, data):        x = self.conv1(data.x, data.adj_t)        x = F.elu(x)        x = F.dropout(x, p=self.dropout)                x = self.conv2(x, data.adj_t)        x = F.elu(x)        x = F.dropout(x, p=self.dropout)                x = self.conv3(x, data.adj_t)        x = F.elu(x)        x = F.dropout(x, p=self.dropout)        return torch.log_softmax(x, dim=-1)`

Beware, I’m calling this model `GraphSAGE` but the original paper’s set up of conv layers, activation etc are described here. The only “SAGE” thing about this model is the `SAGEConv` layers.

# Datasets

I haven’t talked too much about datasets because much of the research in GNNs use standard datasets available in PyG. There’s definitely nothing stopping you from creating a custom dataset but that’s another post for another day (especially when you have a large graph!). In this example, I’ll use a dataset that comes packaged in OGB. To load this into your notebook, make sure to `pip install ogb`.

The below snippet loads an Amazon products dataset from `ogb`.

`import torchimport torch_geometric.transforms as Tfrom ogb.nodeproppred import PygNodePropPredDatasetdevice = f'cuda' if torch.cuda.is_available() else 'cpu'device = torch.device(device)dataset = PygNodePropPredDataset(name='ogbn-products',                                 transform=T.ToSparseTensor())data = dataset[0]# this dataset comes with train-val-test splits predefined for benchmarkingsplit_idx = dataset.get_idx_split()train_idx = split_idx['train'].to(device)`

Some basic information about the dataset is packaged in the `data` object:

`print(f' dataset has {data.num_nodes} nodes where each node has a {data.num_node_features} dim feature vector')print(f' dataset has {data.num_edges} edges where each edge has a {data.num_edge_features} dim feature vector')print(f' dataset has {dataset.num_classes} classes')`

This particular dataset the train, val and test indexes split out for us.

`print(split_idx['train'].shape)print(split_idx['valid'].shape)print(split_idx['test'].shape)`

The adjacency matrix is pre-populated in `data.adj_t` which is a `SparseTensor `matrix since it is `n x n`in shape and that’s huge since there are ~2M nodes!

`SparseTensor(row=tensor([      0,       0,       0,  ..., 2449028, 2449028, 2449028]),             col=tensor([    384,    2412,    7554,  ..., 1787657, 1864057, 2430488]),             size=(2449029, 2449029), nnz=123718280, density=0.00%)`

# Training

When I first started playing with GNNs, I thought it was weird that in the train loop, we always pass the entire graph — we have to pass the entire graph because we need the full structure available to compute the aggregation-and-update steps. But, since we need to train on a certain set of nodes and validate/test on another set of nodes, we just mask out the gradients we want with the indexes of the nodes in our train/val/test set!

P.S. below code stolen from Matthias Fey’s ogb submission!

`# compute activations for train subsetout = model(data)[train_idx]# get gradients for train subsetloss = F.nll_loss(out, data.y.squeeze(1)[train_idx])# evaluate model on test setout = model(data)[test_idx]`

For this `ogb` dataset, the `train` and `test` functions can be packaged like so:

`def train(model, data, train_idx, optimizer):    model.train()    optimizer.zero_grad()    out = model(data)[train_idx]    loss = F.nll_loss(out, data.y.squeeze(1)[train_idx])    loss.backward()    optimizer.step()return loss.item()@torch.no_grad()def test(model, data, split_idx, evaluator):    model.eval()    out = model(data)    y_pred = out.argmax(dim=-1, keepdim=True)    train_acc = evaluator.eval({        'y_true': data.y[split_idx['train']],        'y_pred': y_pred[split_idx['train']],    })['acc']    valid_acc = evaluator.eval({        'y_true': data.y[split_idx['valid']],        'y_pred': y_pred[split_idx['valid']],    })['acc']    test_acc = evaluator.eval({        'y_true': data.y[split_idx['test']],        'y_pred': y_pred[split_idx['test']],    })['acc']    return train_acc, valid_acc, test_acc`

`ogb`comes packaged with an `Evaluator` to help score output predictions.

`lr = 1e-4 epochs = 50 hidden_dim = 75evaluator = Evaluator(name='ogbn-products')model = GraphSAGE(in_dim=data.num_node_features,                  hidden_dim=hidden_dim,                  out_dim=dataset.num_classes)optimizer = torch.optim.Adam(model.parameters(), lr=lr)for epoch in range(1, 1 + epochs):    loss = train(model, data, train_idx, optimizer)    result = test(model, data, split_idx, evaluator)    #logger.add_result(run, result)if epoch % 10 == 0:        train_acc, valid_acc, test_acc = result        print(f'Epoch: {epoch}/{epochs}, '              f'Loss: {loss:.4f}, '              f'Train: {100 * train_acc:.2f}%, '              f'Valid: {100 * valid_acc:.2f}% '              f'Test: {100 * test_acc:.2f}%')`

# TL,DR: rapidly building GNNs in PyG is ez!

Also, if you want to experiment with `GAT` or other types of convolution layers, it would (for the most part) be a simple swap-in-swap-out scenario. Check out the other available layers in the docs here.

The full notebook script is available here although it is mostly a broken down version of Matthias’ code.