Unraveling Culinary Secrets with the Restricted Boltzmann Machine

A Deep Dive into Ingredient Patterns

8 min readJan 6, 2024

A chef stands in front of a cutting board, illuminated with a point-cloud — Image by author

Welcome to our latest exploration where data science meets culinary art! We’ve embarked on a fascinating journey to decode the intricate relationships between ingredients in recipes. In discovering ingredient dynamics we use

a powerful tool from the realm of machine learning: the Restricted Boltzmann Machine (RBM), and
a Unicode-text dataset of Food.com recipe summaries and ingredients.

In this article, we will see how a raw dataset of recipe ingredients that consists of Unicode strings can be transformed into a format for Machine Learning. Next, we build a RBM to uncover hidden relationships between recipe ingredients. Spatial embeddings of ingredients are inferred through training on the Food.com dataset of 500k recipes.

After establishing an efficient data pipeline in PyTorch, a natural progression is to explore advanced machine learning models that can leverage this setup. One such model is the Restricted Boltzmann Machine. RBMs are a type of generative stochastic neural network that can learn a probability distribution over its set of inputs. They are particularly effective in feature detection and have applications in dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling.

Introduction

In the culinary world, ingredients form complex relationships, creating a web of flavors and textures. Our goal is to untangle this web and discover how ingredients co-occur in recipes. We feed the RBM a dataset comprising various recipes, each with a unique combination of ingredients.

The RBM analysis leads to some eye-opening insights. For instance, we find distinct ingredient groups that frequently appear together. Among the groups, we observe

Baking Essentials and Nuts: This cluster contains ingredients commonly used in baking, such as vanilla and almonds, along with corn, which is versatile in many baked goods.
Flavorful and Exotic Mixes: Ingredients like powder, cornmeal, and spices (clove) suggest a mix used in diverse and flavorful dishes, along with fruits like strawberries.
Sweet and Spiced Treats: This cluster, with honey, cookies, and turmeric, seems to represent ingredients used in sweet treats that are either spiced or have healthful properties.
Savory and Rich Ingredients: Ingredients like curry, flour, and fat suggest a focus on rich, savory dishes, possibly in Asian or Indian cuisine.
Gourmet and Specialty: With ingredients like pear, Dijon, and sharp flavors, this cluster might represent more refined, possibly gourmet cooking ingredients.
Hearty Staples: Ingredients like frozen vegetables, carrots, and potatoes are staples in hearty, everyday meals.
Meaty and Bold Flavors: This cluster, with sausage, broth, and hot spices, suggests ingredients used in meaty dishes with bold flavors.
Fresh and Earthy Produce: With mushrooms, portabella, and fresh sprigs, this cluster seems to focus on fresh, earthy produce, likely used in healthful and vegetarian dishes.
Delicate and Fine Dining: Ingredients like cider, crabmeat, and asparagus are often found in more delicate dishes, suggesting a fine dining theme.

Amazingly, the RBM is able to learn all of these ingredients in an unsupervised capacity. Future analyses may identify ingredients through context by training on distinct cuisines.

RBM Architecture and Principles

The RBM consists of two layers and a corresponding two step procedure. A visible layer represents the input data, and a hidden layer learns features from this data. In the most common variant, input units are discrete Bernoulli events.

The input layer — visible units for probabilities of each Bernoulli event — is mapped to the hidden layer.

The hidden layer is mapped back to the visible units.

Unlike the traditional Boltzmann Machine and other neural models like recurrent networks, there are no intra-layer connections — the layers are ‘restricted’ to connections across layers only. This structure allows for efficient training algorithms, such as contrastive divergence, to adjust the weights of the network.

*Diagram of an RBM showing connections between the visible and hidden layers. Credit:* *commons.wikimedia.org*

Integrating the RBM with the Data Pipeline

We previously built a pipeline for streaming training data from the filesystem. The dataset we use consists of nearly 1Gb of unicode strings representing raw recipe data. The RBM learns a unique embedding for each ingredient name, and these embeddings occupy far more RAM than the original strings. We thus make use of the PyTorch DataLoader to stream samples into memory during training. Individual ingredient embeddings are one-hot encoded from the column

ingredients: Cleaned and processed list of ingredients.

containing a recipe’s ingredient names in list format:

[“cherry pie filling”, “condensed milk”, “melted margarine”, “cinnamon”, “nutmeg”]

Training Data: Because ingredient names tend to be short, and principal nouns dominate the start of each name, it suffices to consider only the leading two words in each ingredient. The recipe above is thus encoded as the multi-hot vector of ingredients

cherry, pie, condensed, milk, melted, margarine, cinnamon, nutmeg

Constructing a multi-hot recipe vector from an ingredient list

These multi-hot encoded vectors of ingredients are our training samples. The RBM takes as input batches of recipes x and learns weights, specifically W, b in equations 1–2, to reconstruct the input ingredients.

Integrating an RBM with our established PyTorch data pipeline involves feeding the data streamed from the disk directly into the RBM for training. This integration is seamless thanks to PyTorch’s flexible data handling, allowing the RBM to process large datasets that are otherwise challenging to handle in memory.

Now that we have gone over the preliminaries, let us start coding. We can extend the pandas DataFrame with methods for processing of raw string data.

from pandas import DataFrame as pdf, Series
from tqdm.auto import tqdm
from random import sample, randint
from ast import literal_eval

class DataFrame(pdf):
    """DataFrame structure for loading ingredient lists from the Food.com dataset
    Extends pandas.DataFrame
    Implements preprocess(), collect_first_words() methods for parsing raw ingredient data
    """
    ingredients: Series
    sampled_words: Series
    id: Series
    def __init__(self, data: str | pdf, *ac, **av) -> None:
        """init from path or pandas.dataframe"""
        if isinstance(data, str):
            data = DataFrame.read_csv(data, *ac, **av)

        super().__init__(data=data)
        self.preprocess()
        self.collect_first_words()

    def collect_first_words(self):
        ingr_aggregator = []
        self['sampled_words'] = Series()
        idx_ingredients = list(zip(self.id, self.ingredients))
        for idx, ingred_list\
            in tqdm(idx_ingredients):
            # recipe_id, list of ingredients per recipe
            for ingr_group in sample(ingred_list, randint(0,len(ingred_list))):
                # set of ingredients subset of all ingredients (sample)
                for w in collect_first_k_words(ingr_group, 2):
                    # first two words (split ' ') of each ingredient is entered
                    ingr_aggregator.append({'id': idx,
                        'ingredients': w})
        parsed_ingrs = pdf(ingr_aggregator)
        parsed_ingr_groups = parsed_ingrs.groupby("id").agg(list)
        idx_lookup = self.df.reset_index().set_index('id').to_dict()['index']

        for idx, ingreds in tqdm(parsed_ingr_groups.iterrows(),
                                 total=len(parsed_ingr_groups)):
            self.at[idx_lookup[idx], 'sampled_words'] = ingreds.to_list()[0]

    @staticmethod
    def read_csv(path, *ac, **av):
        df = read_csv(path, *ac, **av)
        return df

    def preprocess(self):
        ingredients = self.ingredients.apply(try_literal_eval)
        self['ingredients'] = ingredients

def collect_first_k_words(words: list, k):
    yield from words.split(" ")[:k]

def try_literal_eval(s):
    """convert raw string into Python object"""
    try:
        return literal_eval(s)
    except: return []

The MultiLabelBinarizer from ScikitLearn is used to convert recipe ingredient lists into multi-hot encoded vectors:

from typing import List, Iterable, Tuple
from types import NoneType
from sklearn.preprocessing import MultiLabelBinarizer
from ingest import DataFrame
from util import split_iter
import numpy as np
from joblib import numpy_pickle

class Binarizer(MultiLabelBinarizer):
    """multi hot"""
    dataframe = None

    def __init__(self, dataframe: DataFrame, classes: List | None = None, sparse_output: bool = False) -> None:
        super().__init__(classes=classes, sparse_output=sparse_output)
        self.fit(dataframe.sampled_words)

    @staticmethod
    def load(dataframe: DataFrame, mlb: MultiLabelBinarizer | NoneType = None):
        if isinstance(mlb, NoneType):
            mlb = Binarizer(dataframe=dataframe)
        xtrain = mlb.transform(dataframe.sampled_words)

        return xtrain.astype(np.float16)

Training the RBM

Training an RBM involves adjusting its weights to minimize the reconstruction error of the input data. This process typically uses a contrastive divergence algorithm. We can train the RBM using mini-batches of data provided by our data pipeline, making the training process both efficient and scalable.

We modify Gabriel M. Bianconi’s CUDA implementation of the RBM in PyTorch, adding density penalizers

torch.norm(self.hidden_bias, 1)
torch.norm(self.negative_visible_probabilities, 1)

and Unicode-wrapping methods for ingredient names:

class IngredientsRBM(RBM):
    data_loader: DataLoader
    def __init__(self, data_loader, num_visible, num_hidden, k, learning_rate=0.001, momentum_coefficient=0.5, 
                 weight_decay=0.0001, use_cuda=True):
        super().__init__(num_visible, num_hidden, k, learning_rate, momentum_coefficient, weight_decay, use_cuda)
        self.data_loader = data_loader
    def fit(self, n_epochs: int=4, batch_size: int=4):
        for epoch in range(n_epochs):        
            train_loader = self.data_loader
            epoch_error = 0.0
            frame_log = tqdm(total=0, position=2, bar_format='{desc}')
            i = 0
            for batch_idx, batch in tqdm(enumerate(train_loader), total=len(train_loader)):
                batch = batch.view(batch_size, self.num_visible)
                batch_error = self.contrastive_divergence(batch.cuda())
                batch_error += 1 *torch.norm(self.hidden_bias, 1)
                batch_error += 1 *torch.norm(self.negative_visible_probabilities, 1)
                epoch_error += batch_error.cpu()
                frame_log.set_description_str("bat %d: %.4f" % (i, epoch_error / i))
                i += 1
            print('Epoch Error (epoch=%d): %.4f' % (epoch, epoch_error))
    def str_sample_hidden(self, x: List[List[str]], binarizer: Binarizer, external=True):
        rbm_embeds = super().sample_hidden(torch.Tensor(
            np.array([binarizer.transform([u]) for u in x])\
            [:,0,:]).cuda())
        if external:
            rbm_embeds = rbm_embeds.detach().cpu().numpy()
        return rbm_embeds
# Training code (one epoch ~ 1 hour)
dataset = Dataset("./out/batches_0_4.job")
data_loader = DataLoader(dataset, batch_size=1, shuffle=True, num_workers=3, prefetch_factor=64)
mlb, _ = BatchedDataFrame.batches(df, batch_size=4)
VISIBLE_UNITS = len(mlb.classes_)
HIDDEN_UNITS,  CD_K  =  12 * 10**3,  12
rbm = IngredientsRBM(data_loader, VISIBLE_UNITS, HIDDEN_UNITS, CD_K)
rbm.fit(n_epochs=1, batch_size=4)

Applications of RBM in Machine Learning

The RBM is trained to minimize the particular energy function

over visible units v and hidden units h, each representing discrete ingredient-presence-events in a product space that contains all ingredients. To recap, the inputs were multi-hot vectors x representing ingredients present in a recipe. We can now isolate individual ingredients through one-hot vectors and explore the embeddings

that the RBM produces. Applying dimensionality reduction to the space of embeddings yields intriguing patterns. The plot below shows the concurrent result of a 2d point-cloud via TSNE, and colored groupings via K-Means in the embedding space initialized to 12 clusters.

Dimension-reduced embeddings of sampled recipe ingredients learned by a RBM. Image by author

Conclusion

Aside from generating thought-provoking embeddings, RBMs have a wide range of applications. They can be used for collaborative filtering in recommendation systems, feature extraction in classification tasks, and as building blocks in more complex deep learning architectures like Deep Belief Networks (DBNs). Integrating a Restricted Boltzmann Machine with our PyTorch data pipeline opens a plethora of opportunities in advanced machine learning tasks. The ability to handle large datasets efficiently makes RBMs a powerful tool in the data scientist’s toolkit.

The implications of these findings extend far beyond the kitchen. They can revolutionize how we approach recipe recommendation systems, making them more nuanced and personalized. Nutritionists and diet planners can use these insights to create balanced meal plans, considering ingredient synergies for health and flavor.

Furthermore, this project exemplifies the power of data science in transforming how we understand everyday activities like cooking. By applying RBM, we’ve gleaned valuable insights from raw data, demonstrating the potential of machine learning in culinary innovation and beyond.

Through this exploration, we’ve just scratched the surface of what’s possible when data science techniques are applied to the culinary world. The future is ripe with possibilities for more such delicious data-driven discoveries!