A Tale of Two Models

“The key to artificial intelligence has always been the representation.” — Jeff Hawkins

Mikhail Seryy
5 min readNov 22, 2019

This is part 2 study of approximately 40,000 recipes of different cuisines from across the globe, collected in the United States.

In the dataset, recipes of dishes are given with a cuisine class as the target to predict. Each recipe is a list of ingredients (water, salt, wheat, etc.) with a class indicating its origins (Greek, Mexico, etc.).

I formulated the question as such: can machine learning model learn the data so that, given a specific list of ingredients, it can predict which cuisine it is?

This means that I need a model that can determine unique ingredients for a cuisine, ignoring common ingredients like salt, and correctly classify them.

Two learning models were created.

Model #1. Random Forest Classifier

Dataset has a large amount of recipes, about forty thousand, with each recipe having a potentially slightly different set of ingredients from others. For each list of ingredients, which may have four to twenty items in it, there were roughly 7,000 items that were not in that list of ingredients. So for each data example, which was filled with a one for an ingredient that was included and zero for one that was not, the vast majority of rows were zeros.

In any Machine Learning(ML) model we start with the baseline.

A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset. You can use these predictions to measure the baseline’s performance (e.g., accuracy) — this metric will then become what you compare any other machine learning algorithm against. Here I am using simple calculation to see how may times each cuisine appears in the dataset. The result are as follows:

What this means is that Italian recipes appear about 20% of the time in the dataset, Mexican recipes appear about 16% and so on. So If I simply say that any given recipe is an Italian recipe I would be accurate 20% of the time. This is not a very good baseline and it directly correlates to Recipe Count graph below.

Lets apply Random Forest Classifier (RFC) model to the dataset.

The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.

By applying RFC model , an accuracy score of 65.87% is accomplished. If the parameters of the model are adjusted a better Accuracy Score can be obtained. First lets take a look what is the optimal number of trees we should build.

As we can see from the graph, after building 40 trees the Random Forest model Accuracy Score does not change significantly. Therefore, 40 is the number of trees that will be used in the model.

There are close to 7,000 ingredients (features) in the dataset. Using all features to train RFC model would not be cost effective. Let us see how many features should be used for the most advantages result.

According to the graph 45 features is an ultimate number to improve the Accuracy Score of the model.

What are the ingredients that our model has selected that were the most useful in making predictions? Here are the top 20 Ingredients.

Random Forest Classifier model parameters were adjusted to build 40 trees and use 45 ingredients. The Accuracy Score was improved to 70.86%. That is to say our model can predict Cuisine based on the Ingredients in the Recipe approximately 71% of the time.

Here is how RFC model performed according to the Confusion Matrix. A confusion matrix is a table that describes the performance of a classification model. It contains information about the actual and prediction classifications done by the classifier and this information is used to evaluate the performance of the classifier.

Model #2. CountVectorizer with Logistic Regression Model

In this model each recipe would be processed through CountVectorizer (CV). The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary. To the processed data I applied Logistic Regression (LR) model. Logistic Regression is a classification algorithm, that is used where the response variable is categorical. The idea of Logistic Regression is to find a relationship between features and probability of particular outcome.

LR Model Accuracy Score

CountVectorizer with Logistic Regression Model produces the Accuracy Score of 77.70%.

Below is the breakdown of cuisines and top 20 words used to predict each cuisine.

In Summary, Machine Learning techniques can increase our ability to predict Cuisine type based on the Ingredients provided. In this study the Accuracy Score of predicting cuisine of the recipe increased from 19.70% to 77% by applying Logistic Regression model.

--

--