Search Engine for Food Recipes using Bipartite graph

Sukesh
4 min readOct 30, 2022

--

Introduction

There are lot of food websites are available in the internet. Every site having own search engine to retrieve food recipes based on the user-given queries. The aim of every sites search engine (SE) are retrieving recipes more likely to be the user-given queries(recipes). Most of the time user doesn’t have an ingredient of SE-recommended recipes this is one of the main drawbacks of this kind of website SE. To address this problem created a search engine that will retrieve food recipes based on the user-given food ingredients instead of user-given food recipes.

Dataset

Where there is data smoke, there is business fire. — Thomas Redman

As we know data is fuel for the machine learning model so we have given more importance to dataset preparation. In the food space, lot of datasets are available in publicly such as(Recipe1M+, Food101, IIITD CulinaryDB dataset, etc..). Those sites are loosing some important features such as recipe image or cuisine or ratings or food category i.e(dinner or lunch), so decided to scrap a data from online website such as(Times of India , All recipes). we scrapped recipe title, ingredients, cuisine, category, instructions, recipe image, ratings and webpage link. Next, will discuss the techniques and the challenges solved during data cleaning.

Data Cleaning

No Data is Clean ,but most is Useful. — Dean Abbott

Since collected data had set of ingredient text for each dish , we need to build a high quality vocabulary for ingredients so we pass this through a custom-built pipeline as shown in Figure 1 that eliminates noises in data

Figure :1 Data Cleaning Pipeline for graph

like numerical quantity words (1/2 cups, 2 leaves) and removes the most frequent appearing words (salt, water, etc..) that are not considering food ingredient because these are not giving efficient result. Created a regex function to remove adjective and adverb and action words(mashed, remove, grated, fresh) . Finally lemmatize ingredients (chillies, cloves, almonds), Figure 2 shows the output of the pipeline.

Figure: 2 Pipeline Output

Model

A theory is just a mathematical model to describe the observations.

After done rigorous preprocessing we have high quality vocabulary for ingredients. Now create a model to retrieve food recipes based on the user given ingredients. We tried two methodologies for this problem,

  • Embedding based technique
  • Graph based technique

Embedding based technique

People in this space are using embedding based techniques to retrieve similar recipes so we are also created embeddings for each ingredients by recipe instructions. To represent recipes in embedding space, we simply averaging the embeddings of each recipe’s ingredient. Figure: 3 shows embedding for lemon juice recipe.

Figure: 3 Embedding for Lemon juice (example)

Based on the user given ingredients we can retrieve similar recipes which are close in the embedding space using cosine distance function shows in Figure: 4 .

Figure: 4 Cosine Distance formula

Graph based technique

In search engine, Embedding based technique results are not significant because of creating embedding for every ingredients are complex. To overcome from the problem we approached graph based technique. Using bipartite graph retrieve recipes by simply and significantly. Figure: 5 shows bipartite graph.

Figure: 5 Bipartite graph

In bipartite graph have two set of nodes. With in a set of nodes are not connect each other. Nodes {RA, RB, RC, RD} are belonging to Recipes set and nodes {IA, IB, IC, ID} are belonging to Ingredients set. Now we connect each ingredient belonging to the their recipe and allocated weight to each edges using the Tf-idf method. Now based on the user given ingredient we retrieve food recipes from recipes set which are maximum matching to user given ingredients in Ingredients set and sorting the recipes based on the sum of user given ingredient’s weight in each node link.

Ingredient Alternative

Some cases user doesn’t have an ingredient mentioned in recipe’s instruction . To address this problem created a context based embedding for each ingredients using cosine distance function recommend an alternative or substitute ingredient which are very close to the missing ingredient in embedding space.

Data

To create an embedding for each ingredients we used recipes instruction as an input data. Due to some noise in recipe instructions we created a custom preprocessing pipeline for eliminate noise such as remove stopwords, lemmatize text, lower case characters and grouping multi word as a single word( red pepper as red_pepper)

Figure: 6 Data Cleaning Pipeline

Model

Word2Vec Technique is used to create a context based word embedding but word2vec can’t handle the unknown words (Out Of Vocabulary OOV)so we are using FastText model because it use subword tokenization technique for create vocabulary and create an embedding for each ingredients.

Thanks for your time, I’m looking forward to your comments:)

--

--