Nine months ago, I decided to undertake some side projects in order to earn extra money after learning that my wife was pregnant with our third baby. Since I knew that my normal earnings would not be enough to support my family in the near future, I had to find other sources of income.
It was then that I decided to build a niche website to do affiliate marketing.
After I started this website project, which specializes in juicing, I only managed to post about 30 articles in nine months.
Furthermore, I wasn’t truly satisfied with these articles since they weren’t much different from other similar articles all over the web.
As such, I knew I had to make a difference and start writing high-quality articles which feature unique content that would easily be promoted on the web.
This was around the same time that I read the article “I analyzed every book ever mentioned on Stack Overflow. Here are the most popular ones,” by Vlad Wetzel on freeCodeCamp.
It suddenly hit me…
I decided to do apply the same principle to juicing recipes.
I had to collect as much juicing recipes as possible to get meaningful results.
However, there was a big problem.
All the recipes were spread out on the web and written in different formats. To collect recipes manually would be very tedious and take at least two months, if not longer.
Thankfully, I stumbled upon yummly, which had already collected thousands of these recipes from around the web.
Things got even better when I discovered that yummly has an API service for the recipes.
Given my experience, it was right up my alley to leverage this API to accomplish my project.
I immediately registered for a free, two-week trial and begun to use the API with Python and SQLite.
However, I noticed that it was impossible to fully differentiate juicing recipes from the other recipes that also included the word ‘juice’. But I realized that I could do so with “smoothie recipes” so I decided to use “smoothie recipes” for the analysis.
To this end, I have written two Python scripts. The first gets the recipes and writes the data to a SQLite database, while the second handles the analysis.
I utilized this Python module for acquiring the recipes from yummly.
Here is ER diagram of the database:
It took around five hours to collect 10,765 smoothie recipes from yummly.com. After inserting these recipes to the database, it was much easier to work with the data.
For the analysis, my aim was to find the most preferred combination of ingredients; however, the problem was that some ingredients in the recipes were written differently.
For example, the ingredient banana was called “bananas” in some recipes, but referred to as “banana” in others.
In order to overcome this problem, I used lemmatizer from the nltk module to convert all of the ingredients to their singular forms.
The other problem I encountered was the noun clauses like “frozen strawberries”. For the sake of my analysis, I only cared about “strawberry” in that clause.
To solve this issue, I used the tokenizer from nltk and only accepted ‘NN’ and ‘NNS’ tokens, which correspond to singular and plural nouns, respectively.
Although I used all of these methods to refine the data, I had to use hard coded exceptions for some specific ingredients that could not be captured by the logic of my methods.
Once I finally sorted all the ingredients in the recipes, it was easier to find the combinations of them.
Here is one of the visualizations from the article.
Check out more on I analyzed 10765 Smoothie Recipes. Here are the Results.
If you are wondering at this point…
I am an embedded software engineer.
I am experienced in embedded systems and mostly the C programming language.
If you haven’t noticed, I am new to Python. Actually, I have been trying to learn Python for only six months.
What do you think about the results?
Am I on the right track in terms of Python programming (data science?) and content marketing?
I appreciate your feedback.