First-timer’s Guide to Pytorch-geometric — Part 1 The Basic

Published in

CJ Express Tech (TILDI)

8 min readNov 28, 2022

Part 1 — The Basics of building datasets with graph-based information and plugging them into models

Introduction

Here’s my first attempt with Pytorch-geometric (PyG) and Graph Neural Network (GNN) models. Perhaps this post will help you through some tricky spots that I have struggled with.

In the basic section, I will go through the steps of passing graph-based information into a PyG-style dataset, splitting them into minibatches, and using it in GNN models.

This requires that you

have designed and built your graph data with networkx or equivalent (I added the links to networkx documentation)

1. Let’s Build a Dataset Object

I found that most tutorials for PyG are using the ready-made Dataset. I have my own graph built in networkx (see an example of an output from networkx’s node_link_data()). So, my next step is to transform the graph-based information into something Pytorch can easily use.

PyG offers a quick tutorial for us here. We’ll pay attention to the InMemoryDataset class, from which we’ll inherit and adapt to our own use cases.

First, let’s take a closer look at the initiation step. InMemoryDataset will check whether the “root” directory, specified with the “root” argument in __init__(), has already contained the file(s) in the “processed_file_names” list. If it does, then it will skip the “process” method and go straight to loading “self.data” and “self.slices”. (Somehow I’ve never got to use “slices”, but I’ll just go with the flow for now.)

# Code from https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html
# Some comments are my own.
class MyOwnDataset(InMemoryDataset):
 def __init__(self, root, transform=None, pre_transform=None, pre_filter=None):
     # the root argument should point to the directory where you have saved the data or want to save it
     super().__init__(root, transform, pre_transform, pre_filter)
     self.data, self.slices = torch.load(self.processed_paths[0])

 @property
 def processed_file_names(self):
     # Will check whether the file(s) in this list is already there in the "root" directory.
     return ['data.pt']

 def process(self):
     # Read data into huge `Data` list.
     data_list = [...]
  
     if self.pre_filter is not None:
        data_list = [data for data in data_list if self.pre_filter(data)]
  
     if self.pre_transform is not None:
        data_list = [self.pre_transform(data) for data in data_list]
  
     data, slices = self.collate(data_list)
     torch.save((data, slices), self.processed_paths[0])

💡 Learn from my mistakes — Try to make sure this part works correctly since it will save you a lot of time from processing the same old thing over and over. Note that it checks ALL the files in the processed_file_names list. Also, ensure that you specify the right directory each time you create an object to avoid overwriting existing files.

The important part is the “process” method. This is where you decide what happens before you save the transformed data. There are 3 pre-defined functions for you to use: transform, pre_transform, and pre-filter. The detailed description is on the tutorial page. However, for first-timers, we can also just define the steps directly and make sure things work the first time and refactor them later. Below is the flow I used.

# 1. Read the graph-json file and other graph-related properties
# 2. Create (or read) edge_index from the graph. KEY - Important to understand how PyG deals with the graph information.
# 3. Create and process (or read) node attributes from the graph 
	- read categorical and numerical node attributes
        - one-hot encoded categorical attributes
	- tensorise
# 4. Create (or read) other information you want to pass into the Data object
# 5. Create a Data object with all the properties I want to use later
	- x (tensorised and processed node attributes)
	- edge_index (a tensor of shape (num_edges, 2) indicates the source node index and the destination node index)
	- y (edge labels - optional, can be defined as node labels if needed)
	- any other things you want to use later

💡 Understanding x and edge_index attributes is the key to understanding how to pass the graph-based information. The first is all about the node and embedding. The latter is a tuple of two equally sized lists: source_node_index, and destination_node_index. The index of the node corresponds to the position in x. See more here.

Here is an example of what the process method looks like for my case.

def process(self):
    self.G = self.read_graph_json()
    edge_index = self.get_edge_index()
    x = self.process_node_features()
	
    # Now create a Data object. More detail in <https://causalai.github.io/pytorch_geometric/modules/data.html>
    data = Data(edge_index=edge_index, x=x)
    # Save it to the correct path. Next time you can skip the process(). 
    torch.save(self.collate([data]), self.processed_paths[0])
    print(f"Saved data to {self.processed_paths[0]}")

Here is another example with more bells and whistles. Note that the Data object I created at the end has a few more attributes.

def process_train(self):
    self.G = self.read_graph_json()
    # Update the graph with spam_df
    self.add_labeled_target_edges()
    # Get information from the updated graph
    from graphbuilder import GraphBuilder # Note - This is a custom class I defined. Ignore it.
    self.gb = GraphBuilder()
    self.gb.G = self.G
    index_map = self.gb.get_index_map()
    # Tensorise target_edges
    target_edge_index = self.get_target_edge_index(index_map)
    # Tensorise edges from known links
    edge_index = self.get_edge_index(index_map)
    # Get and tensorise node features
    x = self.process_node_features()
    # Tensorise edge_label
    edge_label = torch.tensor(self.spam_df['Label'], dtype=float)
    
    # We can add more optional attributes to the Data object, like the target edge we want to predict and the edge label.
    data = Data(x=x, edge_index=edge_index, target_edge_index=target_edge_index, y=edge_label)
    torch.save(self.collate([data]), self.processed_paths[0])
    print(f"Saved data to {self.processed_paths[0]}")

💡 I will discuss how I add “target_edge_index”, a subset of edge index for a task with heterogenous edges in the Applied section.

2. Mini-batching with DataLoader Class

Now, you can actually use the Dataset object in some GNN models, but you may not want to. With larger graphs, you would want to split the graph into several batches. Luckily, there are many ways to do mini-batch with graph data already built in the PyG library. For well-known models in the GNN field, chances are there is already some form of graph-splitting implementation for you. See more discussion on mini-batching with graphs in PyG here.

If you remain confused after reading that page, fear not. That was my reaction when I read that page as well. Turns out there are many kinds of Loader and each one is generally tied to an implementation in a certain paper, including heterogeneous graph transformer, cluster-GCN, and GraphSAINT. Note that the table in the link contains both “Loader” and “Sampler” classes. So, if your goal is to use some ready-made models, you may be in luck. There are also some examples with a variety of datasets, models, and tasks to choose from here.

💡 There are so many tutorials out there. I found looking at too many examples at the same time may be more confusing rather than illuminating. Perhaps it’s wiser to stick to only one source for the first time than to jump around.

In my case, I wanted to try out the “GraphSage” model from the paper “Inductive Representation Learning on Large Graphs”. There are loaders for this paper, so I just had to choose the appropriate one and set the correct parameters.

For GraphSage, I have NeighborLoader and LinkNeighborLoader to choose from. For a given batch_size parameters, the NeighborLoader will pick that exact number of nodes, whereas the LinkNeighborLoader will pick the exact number of edges. Since I wanted to do a link prediction task later, I would like to have a uniform number of edges in each batch, so I picked the latter.

Some interesting parameters are the following:

batch_size: for edge samplers, this will be the number of edges in a single batch (for other Loaders that sample nodes, it will be the number of nodes in a batch.)
num_neighbors: this is specific to this Loader for GraphSage. It specifies the number of neighbors to sample for each node in each iteration. (See more details in the GraphSage paper — Appendix)
edge_label_index: this is how you pass the graph-based information. The Loader will use this
edge_label: optional, only use this for labelled data. Must be the same length as the edge_label_index.

💡 It’s important to re-pass edge_label=data.y, otherwise we won’t have edge_label properties in the LinkNeighborLoader, even if we’ve added it to the Data object. This is a point that brought me quite a commotion before.

neg_sampling_ratio: usually your graph data will only contain positive edges, so you can add negative edges with these parameters. The negative edges will have the label 0. The positive edges will have the label 1 by default. (Alternatively, you can also build graphs with negative edges in the first place by adding the label as the edge property, but that will make your graph rather messy. You may want to keep the graph clean and deal with the negative sample here.)

Let’s see how to tie together the steps from processing raw data into Dataset, loading and splitting them into mini-batches, and using the batches in a model. Below is a partial code from this tutorial. Comments are my own.

# Code from <https://github.com/pyg-team/pytorch_geometric/blob/master/examples/graph_sage_unsup_ppi.py>
# Comments are my own.
# ... skip imports
# 1. Crate Dataset (PPI inherits from InMemoryDataset) - this will call process() or download the existing files at specified paths.
path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'PPI')
train_dataset = PPI(path, split='train')
val_dataset = PPI(path, split='val')
test_dataset = PPI(path, split='test')
# Group all training graphs into a single graph to perform sampling:
# 2.1 Create a Batch object from Dataset
train_data = Batch.from_data_list(train_dataset)
# 2.2 Minibatch - passing the PPI object
loader = LinkNeighborLoader(train_data, batch_size=2048, shuffle=True,
                            neg_sampling_ratio=0.5, num_neighbors=[10, 10],
                            num_workers=6, persistent_workers=True)
# 2.3 Using attributes from the Dataset
model = GraphSAGE(
    in_channels=train_dataset.num_features,  # You can refer to any attribute of a dataset
    hidden_channels=64,
    num_layers=2,
    out_channels=64,
).to(device)
# 2.4 Passing the Dataset into the model
def train():
    model.train()
    total_loss = total_examples = 0
  # tqdm is just a fancy way to add a progress bar. <https://github.com/tqdm/tqdm>
    for data in tqdm.tqdm(loader):  # KEY! Use this to help iterate through minibatches
        data = data.to(device)
        optimizer.zero_grad()
        h = model(data.x, data.edge_index) # KEY! This is equivalent to calling model.forward(). 
                      # Just pass the appropriate attributes you want. Look into the Model (GraphSAGE) class to see the details.
    # ... skip

Conclusion

So that was the usual flow for creating Dataset with graph-based information, splitting them into minibatches, and plugging them with GNN models. The next part will talk about some specific details for a promotion-customer link prediction model I built, particularly, how to select a specific subset of edges for prediction tasks and how to deal with different graphs at training time and at serving time.

See First-timer’s Guide to Pytorch-geometric — Part 2 The Applied for more implementation details.

First-timer’s Guide to Pytorch-geometric — Part 1 The Basic

Introduction

1. Let’s Build a Dataset Object

2. Mini-batching with DataLoader Class

Conclusion

References

Written by Mill Apichonkit