A Walkthrough NLP Workflow Using MeaLeon: Part 2, Rise and Cosine

Aaron Chen
Analytics Vidhya
Published in
7 min readFeb 27, 2020

Hey folks!

Photo by Brad Barmore on Unsplash

Last time, I started a walkthrough of a Natural Language Processing (NLP) workflow by talking about my project: a full-stack machine learning food recommender called MeaLeon.

In the previous article, I covered how I took a collection of documents (recipes) and converted the raw text ingredients into tokens of root words by combining Natural Language Toolkit (NLTK), WordNet lemmatization, and “expert” knowledge (my familiarity with cooking). In this article, I’ll go over vectorization and similarity analysis.

To the Vectors Go the Spoils

Why are we doing vectorization on words? Remember, computers still cannot process language like humans can. We have to convert the true text into something that would make sense to process numerically. We partially did that by creating tokenized root words, and we’re going into a little math here in the next step.

What we’re going to do is create a vector for each ingredient list in a large dimensional space.

Relax if you’re not familiar with math. Think of it this way: when you ask Google Maps or (shudder) Apple Maps how to get somewhere, you’re going to get back directions in “2D” space.

Digression: If there’s elevation changes, it’ll be technically be 3D, but I’d kinda argue you’re always at ground level and no one will tell someone to go up or down in elevation to get somewhere when you’re on roads.

When you take those directions, you could mathematically add them all up to get one directional vector that tells you how far you need to go on the north/south and east/west axes. We’re going to expand on this for our recipes by turning each ingredient into an axis so that each ingredient list can be described as one vector in much larger space. I would draw out a sample recipe here, but that would be impossible: MeaLeon is actually somewhat small with a little over 2,000 ingredients, but that means there are over 2,000 axes and that can’t be shown in 2D space.

Now, the exact implementation of this is done via scikit-learn’s CountVectorizer. It is possible to do all of these steps by calling individual functions and libraries, which I initially did, but you’ll realize that you’ll likely have to adapt your results to something that scikit-learn prefers to work with…so you might as well keep everything in one pipeline.

Weights and Measures: Some Like It (One) Hot

There are a few ways to create the word vectors. The least sophisticated would likely be using a One Hot Encoder that simply represents the presence of a word as a 1 or a 0. Here, I’ll actually be using CountVectorizer to perform the One Hot Encode of each ingredient list. In MeaLeon’s workflow, the actual OneHotEncoder presented unusual problems and was incompatible with the likely better method vectorizing with Term Frequency-Inverse Document Frequency (TF-IDF).

TF-IDF TL;DR

What is TF-IDF? To quote literally the first line of the Wikipedia article on this: “In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.”

A food/drink analogy I’d give is this: When McDonalds and Starbucks first debuted, it was cool and interesting to check out the individual locations. Now, since they’re everywhere, they’re actually not that special. No offense to you if you like both or either of those places, but let’s be real: When you walk out of a Starbucks and see another Starbucks, does your Frappuccino order feel irreplaceable?

With that hot take over, let’s go back to MeaLeon!

After doing EDA, and reading the docs, I realized that CountVectorizer can be configured to return only unique tokens. This meant that I could refactor my logic very easily while preserving the data pipeline for use later with TF-IDF Vectorizer. When you look at the docs, you can see the similarities it shares with CountVectorizer. The reason why? Well, it’s mentioned in the TF-IDF Vectorizer page, but TF-IDF Vectorizer actually uses CountVectorizer anyway.

Now, TF-IDF Vectorizer is a better method to generate vectors because it factors in the appearance of each ingredient in all ingredient lists and reduces the importance of frequently appearing words. With MeaLeon, the ingredients “salt” and “pepper” should logically be considered less important: those two ingredients should show up in almost every recipe and should not be considered as important as something like “lemongrass” or “duck”. As an aside, if you are not putting at least a little salt in your desserts (like chocolate chip cookies), I think you should consider it =)

Ok, so now that the vector spaces and their transformer objects have been created, we can use this transformer to convert new recipes into search vectors. More importantly, these search vectors can then be used with some metric to calculate similarity with the database. For MeaLeon, I have used either OneHot via CountVectorizer or TFIDF and now must decide the similarity measure. Generally, these break down to using one of two categories: distance or cosine similarity.

Paneer and Far: Distance or Cosine Similarity?

Distance is generally Euclidean distance and is pretty similar to what we use in two dimensions. This is basically “here’s something we’re looking at and how far away is that other thing we’re looking at”. For MeaLeon, Euclidean distance would not be a good metric to calculate distances. Why? Well, an individual recipe can have a lot of different ingredients in it and, for simplicity’s sake, let’s just say one recipe has 10 ingredients and the corresponding recipe ingredient vector has 1 in 10 dimensions and 0 in all the others. If we use a new recipe that has 10 entirely different ingredients and find the Euclidean distance between them, we would end up with Sqrt(20). But if we use a recipe that has 20 ingredients, where 10 are the same and 10 are completely different, the Euclidean distance would still be Sqrt(20). You may be asking “Isn’t this too specific of a scenario? When would this ever come up?”

I immediately thought of chocolate chip cookies and the nightmare that is Cincinnati Chili on Spaghetti. Here are very simplified versions of both:

Chocolate chip cookies:

Flour

Baking soda

Salt

Butter

Sugar

Eggs

Vanilla extract

Chocolate

Cincinnati Chili with Spaghetti:

Beef

Onions

Tomato

Vinegar

Worcestershire sauce

Garlic

Chili powder

Cumin

Cinnamon

Cayenne

Cloves

All spice

Bay leaf

Salt

Chocolate

Yes, Cincinnati chili has chocolate in it. Other than that, all the ingredients are different! Oh wait, we haven’t added the unique ingredients from the spaghetti/pasta:

Flour

Eggs

Butter

The Euclidean distance here would get reduced because 5 ingredient weights get reduced to zero, but no one should be suggesting Cincinnati chili with spaghetti as a substitute for chocolate chip cookies. Or at all, actually. The recipe neglects to mention the watery/thin consistency of the chili or the mountain of cheese typically added on top.

Anyway, Euclidean distance would not be a good choice here as it tends to emphasize magnitude of vectors in distance calculations.

Instead, MeaLeon uses cosine similarity. This reduces the importance of magnitude and is instead more concerned with the directions of the ingredient vectors. Here, this should allow us to make better comparisons between different recipes in a large-dimensional space.

Ok, so we’ve got the recipe vectors and the method to make comparisons between vectors…it’s time to look at what we get right? Let’s save that for next time since we’re already at over 1,000 words!

In the meantime, if you have questions or comments, do leave them and I will do my best to address them! Hope to see y’all soon!

Sources:

--

--

Aaron Chen
Analytics Vidhya

Data Scientist with a PhD in Chemical Engineering