First-timer’s Guide to Pytorch-geometric — Part 2 The Applied

8 min readJan 10, 2023

Part 2 — Application details with an example from building a Graph-based embedding for a link prediction task

1. Introduction

After I discussed my first brush with the Pytorch-geometric library and how to prepare graph-based data and use them in some models (see Part 1), I would like to shift your attention to some minor, but hopefully useful, points I ran into during the implementation to my problem.

Key points of this part cover graph design with multiple node and edge types, train/val/test split, mini-batch splitting with graphs, and getting node embeddings for prediction tasks.

Here is the simplified problem statement to give you some context for the applied part.

Problem Statement

“For a given promotion, find n customers who are most likely to redeem that promotion. Knowing this, we can limit the cost of notification messages and avoid spamming customers with promotions they are not interested in.”

With historical transaction data, customer master table, and promotion master table, we know who (customer id) redeemed what promotions (promotion id), what the customers look like (customer demographic), what type of products and what type of promotions they like (promotion details).

Our hypothesis is that the “graph-based information” will help us better predict who would redeem a certain promotion because the graph would allow the model to learn about one node from its neighbours. One idea we tried was to build a graph based on past transactions and then build a promotion-customer link prediction model for future transactions. Our simplest version embeds nodes using their features and the transaction-based graph. Then we use the embeddings to fit a good old binary classification model to predict whether a given customer node and a promotion node will have an edge (an interaction) between them or not.

💡 As a starter, see pytorch-geographic tutorials on a link prediction problem on a static graph at MovieLens link prediction tutorial. This one features a bipartite graph — two types of nodes (users and movies) and one type of edges (connecting a user node and a movie node).

2. Graph Building For Train/Val/Test and Mini-batch Splitting

There are several choices to make when you build the graph. Think about what should be represented as nodes, whether there will be only one or several types of nodes in the graph, what should be represented as edges, whether node features are useful or just node ids are fine, what kind of relationships the edges represent, etc. These choices are experimental in nature. I found it useful to start small and simple, then build up the more complex graph later once we know what information is useful for that task.

In our first iterations, we started with only a few basic features such as age, gender of customers, how long a promotion runs, and what brand a product belongs to. Those features are encoded as node features. We drew edges to connect a customer node with all the promotion nodes which that customer has redeemed. For a promotion node, we drew edges to all the products associated with it, and also to all associated product category nodes.

Based on the data above, we can draw a graph where each customer id becomes a customer node, each promotion becomes a promotion node, and each product becomes a product node. We can go through a period of transaction data and draw edges (links) conditional on the interactions, for example:

between a customer node and a promotion node when the customer redeemed that promotion
between a promotion node and all the product nodes associated with it.

Our whiteboard during the graph design session

Another key point is to think about what the graph would look like at the serving time. The graph for the training step may have all the nodes and the edges, but at the serving time there will be unseen nodes and edges, so you need to update the graph correctly before making predictions.

In our case, the training graph contains, for example, customer nodes up to March 2022 and transactions between January-March 2022. However, our val and test sets are transactions from June 2022 and September 2022, respectively. There are new customers who signed up after March 2022. Naturally, all promotions we want to evaluate are all new nodes. There are even some new unseen products. We want to add these nodes and edges to the old graph so that each node can learn from its closest and distant neighbours.

There are also different types of edges we need to think about. Our target is to predict whether an edge between a given promotion node and a given customer node exists or not. We can also frame the problem as predicting whether an edge should get a label 1 if there is an interaction between the customer and the promotion and a label 0 otherwise. With the Pytorch-geometric library, we can choose any of the two versions, since the library offers a negative sampling functionality.

Similarly, at the serving time, we can update the original graph with information that is known before prediction, such as which products the new promotions are associated with and the new members’ history up to that point. We can separately specify the index of nodes in the edges we want to predict. This is useful when you have pre-selected a segment of customers you want to predict instead of making predictions for all customers.

Implementation

For the train/val/test sets, we updated and saved one graph for each set. During the Dataset instantiation, each set will refer to different graphs. The code below is how I implemented my InMemoryDataset object to refer to each graph.

💡 Make sure to keep the target edges (what we want to predict) separately. The edge_index attribute does NOT need to contain the target_edge_index.

from torch_geometric.data import InMemoryDataset
import json
import networkx as nx

class MyDataset(InMemoryDataset):
  # skip other methods ...
  def process(self):
        self.G = self.read_graph_json()
        # Update the graph with spam_df
        self.add_labeled_target_edges()
        # Get information from the updated graph
        from graphbuilder import GraphBuilder
        self.gb = GraphBuilder()  # Don't need transaction data as input since we already have an updated graph
        self.gb.G = self.G
        index_map = self.gb.get_index_map()
    # Split edges in the updated graph into non-target edges (no edge labels) and target edges (have edge labels)
        non_target_edge_index, target_edge_index = self.split_edge_index(index_map)
        # Get and tensorise node features
        x = self.process_node_features()
        # Tensorise edge_label - just read the Label column from a dataset we create for each split
        edge_label = torch.tensor(self.train_df['Label'], dtype=float)
    
    # Creating a Data object
    # - NOTE that edge_index contains only non_target_edge_index (partially updated graph)
    # - whereas the target_edge_index contains only the edges we want to make predictions on 
    # - the length of the target_edge_index and the edge_label ('y') must be the same
    data = Data(x=x, edge_index=non_target_edge_index, target_edge_index=target_edge_index, y=edge_label)

        torch.save(self.collate([data]), self.processed_paths[0])
        print(f"Saved data to {self.processed_paths[0]}")

  def read_graph_json(self):
    # We saved the graph for each split separately. 
    # For example, 'graphv1-train-G.json', 'graphv1-val-G.json', etc.
    # self.split is set to 'train', 'val', or 'test'.
        # Using the attribute 'split' to point to different graphs for train/val/test
        path = f'{self.prefix}-{self.split}-G.json'
        with open(path, 'r') as f:
            G = nx.Graph(json_graph.node_link_graph(json.load(f)))
        return G

  # skip other methods ...

Create each dataset for each split.

train_dataset = MyDataset(root=root,
                              encoder_path='./model/encoder',
                              data=train_df,
                              split='train')

val_dataset = MyDataset(root=root,
                              encoder_path='./model/encoder',
                              data=val_df,
                              split='val')

test_dataset = MyDataset(root=root,
                              encoder_path='./model/encoder',
                              data=test_df,
                              split='test')

Split each dataset into mini-batches.

data = train_dataset
train_loader = LinkNeighborLoader(data, batch_size=1024, 
                            # This refers to the edge indices for which neighbours are sampled to create mini-batches.
                            # KEY: must specify only target edges here, otherwise it will sample any edge
                            edge_label_index=data.target_edge_index,
                            # The edge_label corresponds to the target edge labels.
                            edge_label=data.y,
                            shuffle=True,
                            num_neighbors=[10, 10],
                            num_workers=6, persistent_workers=True)
# skip other splits...

3. Get Node Embeddings

To get node embeddings, a real-valued representation of each node, we train a model on a certain task (does not need to be the link prediction task) and get the embeddings as the intermediate input for our link prediction in the final step.

In our first iteration, we trained the model for embedding by making it predict whether any pairs of nodes will have an edge or not (binary classification). This is a bit different from our intended task where we only want to predict edges between customer nodes and promotion nodes.

💡 By setting the task in this step to predicting any links, we can use Pytorch-geometric’s parameter: “neg_sampling_ratio” in the LinkNeightborLoader object, without explicitly creating a labeled dataset yourself.

Using the neg_sampling_ratio to sample and predict any edges. See details in this doc.

train_data = Batch.from_data_list(train_dataset)
loader = LinkNeighborLoader(train_data, batch_size=2048, shuffle=True,
                            # Use the built-in neg_sampling_ratio
                            neg_sampling_ratio=0.5, 
                            num_neighbors=[10, 10],
                            num_workers=6, persistent_workers=True)

Alternatively, we can explicitly train the model on labeled data using only the edges between customer nodes and promotion nodes.

train_data = Batch.from_data_list(train_dataset)
loader = LinkNeighborLoader(train_data, batch_size=2048, shuffle=True, num_neighbors=[10, 10],
                            num_workers=6, persistent_workers=True,
                            # Explicitly add the edge index and the edge label attributes
                            edge_label_index=train_data.edge_label_index, edge_label=train_data.edge_label)

With the graph I described above, our embedding of each node relies on

node features (e.g. age, gender, brand, etc.)
its neighbours (both immediate and distant).

Therefore, two customers with the same age and gender will get different encoding if their past relationship with promotions differs. With random neighbour sampling, even two nodes with exactly the same attributes and history may end up with different embeddings.

Below is the code for getting the embeddings from our mini-batches.

class Embedder:
    def __init__(self, model_path):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model = GraphSAGE(
                                in_channels=NUM_IN_CHANNELS,
                                hidden_channels=NUM_HIDDEN_CHANNELS,
                                num_layers=NUM_LAYERS,
                                out_channels=NUM_OUT_CHANNELS,
                            ).to(self.device)
        self.model.load_state_dict(torch.load(model_path))
        self.model.eval()

    def get_edge_embedding_minibatch(self, data):
        h = self.model(data.x, data.edge_index) # return node embeddings this is for all nodes shape = num_nodes, embedding_dim
        h_src = h[data.edge_label_index[0]]
        h_dst = h[data.edge_label_index[1]]
        return h_src, h_dst

    @torch.no_grad()
    def encode_minibatch(self, loader):
        xs, ys = [], []
        for data in tqdm.tqdm(loader):
            data = data.to(self.device)
            h_src, h_dst = self.get_edge_embedding_minibatch(data)
            one_x = torch.cat([h_src, h_dst], dim=1) # concatenate two node embeddings into one vector
            xs.append(one_x)
            ys.append(data.edge_label)
        x = torch.cat(xs, dim=0)
        y = torch.cat(ys, dim=0)
        return x, y

4. Link Prediction Model

This part is rather straightforward. After getting the embedding for all pairs of nodes we want to make predictions, we can just pass the embedding pairs and the label into whichever model we want to do the binary classification task.

One baseline we used in our experiment is the simple dot product + sigmoid function.

class Embedder:
    def __init__(self, model_path):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model = GraphSAGE(
                                in_channels=NUM_IN_CHANNELS,
                                hidden_channels=NUM_HIDDEN_CHANNELS,
                                num_layers=NUM_LAYERS,
                                out_channels=NUM_OUT_CHANNELS,
                            ).to(self.device)
        self.model.load_state_dict(torch.load(model_path))
        self.model.eval()

    def get_edge_embedding_minibatch(self, data):
        h = self.model(data.x, data.edge_index) # return node embeddings this is for all nodes shape = num_nodes, embedding_dim
        h_src = h[data.edge_label_index[0]]
        h_dst = h[data.edge_label_index[1]]
        return h_src, h_dst

    @torch.no_grad()
    def encode_minibatch(self, loader):
        xs, ys = [], []
        for data in tqdm.tqdm(loader):
            data = data.to(self.device)
            h_src, h_dst = self.get_edge_embedding_minibatch(data)
            one_x = torch.cat([h_src, h_dst], dim=1) # concatenate two node embeddings into one vector
            xs.append(one_x)
            ys.append(data.edge_label)
        x = torch.cat(xs, dim=0)
        y = torch.cat(ys, dim=0)
        return x, y

You can also use the embedding pairs in other ways, for example, by concatenating the embeddings of the two nodes into one tensor.

5. Conclusion

That’s a wrap for the applied part of this first-timer’s guide to Pytorch-geometric. There are many parts where you can experiment with — from how to design and update your graphs to contain the relational information you want to feed into your model, and how to use the built-in parameters in Pytorch-geographic to help with tasks such as negative sampling the edges. Happy experimenting.