What’s Cooking? Using GNNs to redefine Culinary Boundaries

Published in

Stanford CS224W GraphML Tutorials

12 min readMar 22, 2023

Stanford CS 224w (Machine Learning with Graphs) course project by Noah Cowan, Will Shabecoff, and Kushagra Gupta

Our team getting ready to generate the next best dish

Graph neural networks (GNNs) have become a popular approach for modeling graph-structured data in recent years, with applications in various fields, including social networks, biology, and recommendation systems. In this blog post, we’ll explore how we can use GNNs to predict new edges between ingredients in a food network, based on their flavor profiles.

The idea of using flavor profiles to model food ingredients is not new. Food scientists have long used flavor descriptors such as sweet, bitter, and umami to characterize the taste of different ingredients and products. In recent years, this information has been used to create flavorgraphs, which visualize the flavor characteristics of different ingredients and products in a graph-like structure.

By treating the flavorgraph as a network and using GNNs, we can leverage the rich flavor information to predict new edges between ingredients, indicating potential flavor pairings or substitutions. This approach can have practical applications in food product development, recipe recommendation, and menu planning, among others.

In this blog post, we’ll go through the steps of building a GNN-based model for predicting new edges between ingredients in a flavorgraph. We’ll start by exploring the flavorgraph dataset and preprocessing it for GNN training. Then, we’ll build and train the GNN model using PyTorch Geometric, a popular library for GNNs in PyTorch. Finally, we’ll evaluate the performance of our model and discuss potential future directions for this research.

The link for our code can be found in this Google colab.

Problem Statement (What’s Cooking?)

We want to discover new recipes that no chef ever dreamt of! Creating novel and reasonable subgraphs between ingredients is the main objective of this task. The creation of new recipes has numerous benefits such as providing diversity to the dining table, exploring and appreciating different cuisines, cultures, and traditions, introducing healthier ingredients and cooking methods, and providing personal satisfaction and fulfillment by sharing new recipes with others.

Link prediction is the key task here, as it helps us to identify new and interesting ingredient pairings by finding new links between ingredient nodes. Our FlavorGraph database contains food compound information that can help us to create compelling flavor combinations that may not have been explored before due to various reasons such as cultural differences. We use link prediction for two separate problems: choosing a specific ingredient and constructing a recipe iteratively by using the most likely next edge. With this approach, it becomes easier to build a recipe step by step and stop when the edge likelihood is not high enough. We are excited to see the fresh and innovative food pairings discovered by link prediction using graph neural networks.

Dataset (Basically whatever’s left in the fridge):

We use the FlavorGraph[1] dataset for our task. The FlavorGraph dataset is a graph-based dataset that captures the flavor characteristics of various food ingredients. It consists of a graph where nodes represent different ingredients, and edges represent flavor similarities between them. The graph is constructed based on the co-occurrence of flavor compounds in different ingredients, and the edges are weighted based on the cosine similarity between the flavor compound vectors of the ingredients. The dataset contains 1,027 ingredients and 28,186 edges, and the flavor compounds were extracted from the Flavornet database. The FlavorGraph dataset has been used in various research works to explore the relationships between different ingredients and predict new flavor pairings or substitutions.

Preprocessing (Cutting and Chopping)

FlavorGraph contains nodes corresponding to both food ingredients and chemical compounds. For our project, we restrict ourselves to food ingredient nodes. This allows us to work with a homogenous graph, as all the nodes and edges are of a similar type. One advantage of using a homogeneous graph for link prediction is that it is simpler to model, and the GNNs designed for such graphs are often computationally efficient. We load in the dataset for nodes and edges, and restrict the nodes to ingredients observed in edges. This results in 6,653 ingredient nodes, with 111,355 edges between them.

edge_url="https://raw.githubusercontent.com/lamypark/FlavorGraph/master/input/edges_191120.csv"
edges_df=pd.read_csv(edge_url)
print(edges_df.edge_type.value_counts())
##########################################
edges_df = edges_df.iloc[:111355,:]
##########################################
edges_df.head()

Next, we create an `edge_index` using only the nodes that have edges to other ingredient nodes (i.e. removing nodes which only have edges to food compounds). It’s worth noting that the edge weights in FlavorGraph are not derived from explicit taste tests or sensory evaluations of food, but rather from the chemical composition of the ingredients. The chemical fingerprints are obtained by representing each ingredient as a vector of molecular descriptors, which are quantitative properties that describe the chemical and physical characteristics of the molecule. The cosine similarity between two ingredient vectors is then computed, which measures the similarity between the two molecules based on their chemical fingerprints.

Our objective is to partition links or edges into training, validation, and testing data sets in a random manner. To accomplish this, we may utilize the `RandomLinkSplit` module available in PyG.

transform = RandomLinkSplit(is_undirected=True,
      add_negative_train_samples=False,
      disjoint_train_ratio=0.35)

train_data, val_data,test_data = transform(flavorGraph)

There are several things to note about this output data.

Initially, we split edge_index into training, validation, and test splits, ensuring that the validation and test splits do not include any edges from their respective splits (i.e., only the training split contains edges). This is done to prevent any target leakage in node embeddings when making predictions on validation and test data, as edge_index (and x) is utilized for the encoder’s node embedding creation.

Next, two new attributes, namely edge_label and edge_label_index, are appended to each split data. These attributes represent the edge labels and indices related to each split, respectively. The edge_label_index is employed by the decoder to generate predictions, while the edge_label is utilized for model evaluation.

Thirdly, negative links are introduced into both val_data and test_data, with the same number as positive links (neg_sampling_ratio=1.0). The negative links are added to the edge_label and edge_label_index attributes but not included in edge_index as we do not want to utilize negative links during the encoder’s node embedding creation. Moreover, we do not add negative links to the training set here (by setting add_negative_train_samples=False) as they are added during the training loop in train_link_predictor. This randomization during training aims to enhance the model’s robustness.

PyG Link Prediction (Our Masterchef)

PyG utilizes autoencoder decoders to perform link prediction[2]. In PyG, the autoencoder is a graph neural network (GNN) model that aims to learn node embeddings, which are low-dimensional representations of the nodes in the graph. The autoencoder is trained on the input graph, where the encoder learns to create node embeddings by aggregating the features of neighboring nodes, and the decoder tries to reconstruct the original graph from the node embeddings.

After training the autoencoder on the input graph, the decoder is used for link prediction. The decoder can predict the likelihood of the existence of an edge between two nodes by computing the similarity of their node embeddings. Specifically, given two nodes’ embeddings, the decoder can output a score representing the probability of the existence of an edge between these nodes. If the score is above a certain threshold, the edge is predicted to exist; otherwise, it is predicted not to exist.

PyG employs a random link split to partition the input graph into training, validation, and test sets. During training, the decoder is trained on the training set, and the model’s performance is evaluated on the validation set. After the model has been trained, it is used for link prediction on the test set.

Overall, PyG’s approach to link prediction using autoencoder decoders involves learning node embeddings with an autoencoder and using these embeddings to predict edges’ existence between nodes.

The prediction steps are described below:

The process of creating node embeddings involves the utilization of an encoder that utilizes two convolution layers to process the graph.

2. To enable the model to perform binary classifications with positive links from the original edges and negative links from the added edges, we randomly add negative links to the original graph.

3. The decoder performs link predictions (i.e., binary classifications) on all edges, including negative links, using the node embeddings. To achieve this, the decoder computes a dot product of the node embeddings from the pair of nodes on each edge, and subsequently aggregates the values across the embedding dimension. This results in a single value for each edge, representing the probability of edge existence.

The following image provides a summary of the edge split process for both the encoder and decoder, where only the colored edges are used in each stage.

Cuisines of Model Architectures

GCN

The GCN (Graph Convolutional Network)[3] is a type of convolutional neural network designed specifically for graph data. It operates by aggregating node features of neighboring nodes to compute node embeddings, which can then be used for various graph-based tasks such as node classification, link prediction, and graph classification.

In the context of link prediction, GCN is used to predict the presence or absence of edges between pairs of nodes in a graph. The process involves learning node embeddings using a GCN encoder and using these embeddings to predict the existence of edges between nodes using a GCN decoder.

The GCN encoder applies graph convolutions to the input graph by aggregating the features of a node’s neighbors and updating the node’s own feature representation. The process is repeated for each node, resulting in a set of node embeddings that represent the graph’s topology.

Next, the GCN decoder utilizes the learned node embeddings to predict the existence of edges between pairs of nodes. This is done by computing a score for each potential edge between the embeddings of the two corresponding nodes. If the score exceeds a certain threshold, the edge is predicted to be present; otherwise, it is predicted to be absent.

Overall, GCN-based models for link prediction can be trained using a supervised learning approach, where the model is trained on a labeled dataset of positive and negative edges to learn the optimal edge-prediction function. This function can then be used to predict the existence of edges in unseen graphs.

Our PyG implementation looks like the following

class Net(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        # The main GNN layers, two graph conv layers
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)

    def encode(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        return self.conv2(x, edge_index)    # Simple dot product based decoder
    def decode(self, z, edge_label_index):
        return (z[edge_label_index[0]] * z[edge_label_index[1]]).sum(dim=-1)    # Want probabilities out to easily interpret results
    def decode_all(self, z):
      prob_adj = z @ z.t()
      # Apply sigmoid function to get probabilities
      prob_adj = torch.sigmoid(prob_adj)  
      return prob_adjmodel = Net(flavorGraph.num_features, 128, 64)
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01)
criterion = torch.nn.BCEWithLogitsLoss()

GraphSAGE

GraphSAGE is a type of graph neural network that learns node representations by aggregating feature information from its local neighborhood[4]. GraphSAGE can be used for link prediction by training the model to predict whether a given edge exists between two nodes in the graph.

To use GraphSAGE for link prediction, the first step is to preprocess the graph data by generating node features based on their local neighborhood. This is done by sampling a fixed number of neighbors for each node and aggregating their feature information using a trainable aggregation function. These aggregated features are then used as the input to a multi-layer perceptron (MLP) that outputs a score for each potential edge between pairs of nodes.

The MLP can be trained in a supervised manner using a binary cross-entropy loss function that penalizes incorrect edge predictions. During training, the MLP learns to make edge predictions based on the local neighborhood features of each node pair.

Once the model is trained, it can be used to predict edges in new graphs by feeding in the local neighborhood features of the nodes in the graph and applying the trained MLP to predict the presence or absence of edges between node pairs.

Overall, GraphSAGE has shown promising results in link prediction tasks and can be easily applied to a variety of graph-based applications such as social networks, recommendation systems, and knowledge graphs.

Our PyG implementation for GraphSAGE can be found below:

class Net(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = SAGEConv(in_channels, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, out_channels)

    def encode(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        return self.conv2(x, edge_index)    def decode(self, z, edge_label_index):
        return (z[edge_label_index[0]] * z[edge_label_index[1]]).sum(dim=-1)    def decode_all(self, z):
      prob_adj = z @ z.t()
      # Apply sigmoid function to get probabilities
      prob_adj = torch.sigmoid(prob_adj)  
      return prob_adjmodel = Net(flavorGraph.num_features, 128, 64)
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01)
criterion = torch.nn.BCEWithLogitsLoss()

Training (Turn On the Oven!)

Due to computational limitations, we utilized GCP resources provided by the course to train our models. For hyperparameters, we employed the Adam optimizer with a learning rate of 0.01. Our model architecture consisted of 2 layers of either GCN or GraphSAGE with a hidden embedding size of 32, as illustrated in our model architecture diagrams. We used BCEWithLogitsLoss and ROC-AUC metrics for our link prediction task. We trained both models for 150 epochs.

Here’s our training code

def train():
    model.train()
    optimizer.zero_grad()
    z = model.encode(train_data.x, train_data.edge_index)

    # We perform a new round of negative sampling for every training epoch:
    neg_edge_index = negative_sampling(
        edge_index=train_data.edge_index, num_nodes=train_data.num_nodes,
        num_neg_samples=train_data.edge_label_index.size(1), method='sparse')    edge_label_index = torch.cat(
        [train_data.edge_label_index, neg_edge_index],
        dim=-1,
    )
    edge_label = torch.cat([
        train_data.edge_label,
        train_data.edge_label.new_zeros(neg_edge_index.size(1))
    ], dim=0)    out = model.decode(z, edge_label_index).view(-1)
    loss = criterion(out, edge_label)
    loss.backward()
    optimizer.step()
    return loss
@torch.no_grad()
def test(data):
    model.eval()
    z = model.encode(data.x, data.edge_index)
    out = model.decode(z, data.edge_label_index).view(-1).sigmoid()
    y = data.edge_label.cpu().numpy()
    pred = out.cpu().numpy()
    return roc_auc_score(y, pred)validationMetrics_SAGE = []
best_val_auc = final_test_auc = 0
for epoch in range(1, 150):
    loss = train()
    val_auc = test(val_data)
    test_auc = test(test_data)
    if val_auc > best_val_auc:
        best_val_auc = val_auc
        final_test_auc = test_auc
    validationMetrics_SAGE.append([val_auc, test_auc])
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Val: {val_auc:.4f}, '
          f'Test: {test_auc:.4f}')print(f'Final Test: {final_test_auc:.4f}')z = model.encode(test_data.x, test_data.edge_index)
final_edge_probs_SAGE = model.decode_all(z)

The final plots after the training were:

Evaluation (The Taste Test)

To evaluate the accuracy of our model in predicting ingredient pairings, we employed the ROC-AUC (Receiver Operating Characteristic — Area Under the Curve) metric. This metric measures the area under the curve of the true positive rate against the false positive rate curve. GraphSAGE gave test AUC between 50–55%, whereas GCN blew the metric out of the frying pot by achieving test accuracies close to 90%. This makes sense as GCN generally perform better than GraphSAGE in scenarios where the graph has a homogeneous structure, meaning the nodes have a similar role and connectivity patterns across the graph. GCN’s message-passing mechanism, which uses a fixed filter over the graph, is particularly effective in capturing such homogeneous structure.

While some of our recipe predictions make sense (using {tortilla, pepper, green onion, garlic powder, egg, and soy sauce} or {almond half, almond syrup, and apple smoked bacon), some clearly do not! For eg {ground beef, acorn, cream of mushroom soup, and tortilla} is simply a crime against humanity. Identifying subgraphs instead of using link prediction for recipe prediction might make some recipes less weird, as that would look at the overall interaction between ingredients rather than just looking at which ingredient does our most recent ingredient go best with.