As an English philology graduate, I have always been interested in language related data but doing something even merely connected to NLP (Natural Language Processing) was a complete Science Fiction. Well, until I found out about Digital Academy run by Czechitas. When I was choosing the topic for my project, I was thinking about what would be the best thing to keep me motivated during those expected sleepless nights while trying to make my python scripts work. And I found it.
An internet phenomenon has emerged in the Czech Republic in the past few years that I have not seen in Western Europe (if you have, please write me an e-mail). There is a big community of mothers of any age who gather on the Internet and among others share their family pictures, stories, recipes, buy and sell used baby goods. These mothers have developed their own dialect almost unintelligible by outsiders and produce tremendous amounts of data each day. Therefore, I decided to use this opportunity and look at the most visited Czech website dedicated to mothers and their babies — specifically, the perhaps infamous but certainly popular recipes section of mimibazar.cz. The website claims to be a “family advertisement server”. As far as I know, it started as a place, where people could sell baby clothes their offspring no longer needed, which is ecological and proves to be a successful business plan. In the meantime, the users of this service have generated over 130,000 recipes which is roughly 7.5 times more than on NYT Cooking, according to this article. There is also a group on Facebook where the opposite social group gathers to mock them and shares the crème de la crème from Mimibazar and similar websites.
Before I share my findings with you, I would like to give you some background on why there is so much data produced in such a small market. Czech Republic, self proclaimed to be the most western country of the former Eastern Bloc, is a place, where motherhood is highly valued. Czech labor law is so supportive that maternity leave is paid for 6 months with 60% of the former salary of the mother and some employers even choose to top this up to 100% as a benefit. Moreover, after this period, the state provides financial aid to the family and one of the parents can choose to stay at home with the child for a period of up to 4 years. During this time, the employer must keep the position reserved if the parent decides to return to work. According to finance.cz, this makes Czech Republic the leading country in the European Union with regards to the allowed time spent on parental leave. When a couple decides to have a second child, the period starts over. It is quite usual for women not to come back to work for 5 to 6 years after they have given birth to their first child. Maybe you are asking, “What do they do all those years at home?” or “Apart from the baby, what keeps them mentally occupied?” Some of them, like my mother, go to school or get a part-time job. Others spend their time socializing on the Internet and producing the vast amounts of data my work is based on.
To parse the data from the website, I used the python library BeautifulSoup. You can find a link to my GitHub at the end of this article where I have shared the script I used to scrape the data. It was not an easy exercise and required a lot of reading, self study, trials and errors. First I had to become familiar with HTML structures, then I tried a few tutorials for web scraping I googled and in the last phase, I took a crash course about regular expressions. The most difficult task was to separate the ingredients part of each recipe from the instructions, as there was no special class for them in the HTML code.
This is what it looks like when you inspect the page:
At this stage, I created an account on Mimibazar and tried to enter an empty recipe, which was possible. My account was erased by the administrator the next day because I had not filled all mandatory fields. The lack of governance on their end meant that I could not expect that the recipe would start or end with a block of text.
Using the regex101 as a playground and replacing the b tag with @@@ (an expression not used in natural language) was recommended to me and the script managed to download all 130,000 recipes without any difficulties.
After the scraping exercise, I was left with a JSON file containing all 130,000 recipes. I managed to replace unwanted characters (see my GitHub) and using pandas I filtered only recipes which have at least 10 characters in both ingredients and instructions columns. There were many recipes which did not include any text, mainly in the cake category. I have a theory that a certain user called Evík is running a business on posting pictures of her creations and the ladies can order them via private messaging. After I was done, I was left with 72,919 recipes, which means 44% loss. I have noticed that many users use emoji like smiley faces or hearts when they are lazy to fill in a field. Or they just take a picture of the meal and do not insert any other information.
To help me lemmatize the data (convert them to their dictionary form), I visited a small Czech startup in their beautiful office right next to the Charles Bridge called Geneea. To do this, we used their tool Frida (see the demo) which gave me the list of keywords and phrases with the frequency of their use in a nice user interface. I also received a complicated but substantial JSON file to play with, which I thoroughly enjoyed. It performed well with diacritization of Czech language for recipes, where the contributor did not use Czech special characters. The JSON file happened to be 4.5 GB which gave me an opportunity to learn how to upload and download files from AWS S3 buckets using the command line. It was also difficult to work with the file since I ran out of memory when I tried to load it to my Jupyter Notebook. More about this in the next sections.
Some screenshots from Frida:
I was now able to use the csv file I exported using my growing python skills and Jupyter Notebook and downloads from Frida to produce Tableau visualizations. I learned that it is better to feed Tableau with pre-filtered data using pandas rather than Tableau directly as it was unable to separate columns in a csv file correctly. Using MS Excel for this was out of the question as I had too much data to work with and I would, again, run out of memory (my poor computer has been complaining a lot in the past few days). There is a link to the dashboard I created at the bottom of this article.
As a starting data analyst, I suffer from the same thing as my more experienced colleagues. I kept digging and digging into the code, forgetting to put at least some preliminary results into some nice visuals to see what I have there. After I did this, I suddenly understood, why there is so much sugar and flour in all the output from Frida.
The leading 2 categories of the recipes were cakes and sweet pastry.
Among the 10 most used ingredients belong: sugar, flour, salt, eggs, butter, oil, milk, black pepper and onions. Czech rum (made from potatoes, often used in baking) has surprisingly a lot of occurrences. It is surprising to see that it is used more than for example caraway seeds, which are used widely in Czech cuisine.
Why is it so? Try to imagine you are a mother on a long parental leave and you cook every day to feed your husband and children. Which of your creations would you be most likely to share with your virtual friends? Probably the one for special occasions where celebration is involved. Thus the cake category has the biggest number of hits. The ratio of sweet based recipes is, however, quite alarming considering the fact that more than a half adult population of this country suffers from obesity and percentage in children is also raising to 20–30% (numbers). Let’s not turn this article into preaching about healthy food, however.
After I saw this, I decided that filtering out sugar based recipes from my analysis would give me better results as all of them contain the same basic ingredients (sugar, flour, eggs, butter and baking soda). Therefore, I went back to my original json and using pandas and excluded all sweets. I put the categories into MS Excel first (the same file I fed Tableau with above) to remove the duplicates and pick the exclusions.
After this exercise, there were only 28,149 recipes left. Which means I excluded 61% from the clean data set.
Then we ran Frida again and here is the result:
Let’s look into Tableau:
The most used ingredients now changed to salt, oil, black pepper, onion, flour, garlic, meat, cheese, eggs and butter.
So far, it looks like standard ingredients which are used in any recipes, regardless whether they are created by an haute cuisine chef or a mother of four. What then makes the recipes on Mimibazar so special that they became a cult in the Czech cyberspace? It is not the ingredients themselves but the endless human creativity that can put them into very bizarre combinations to save time, money or just make the food look edible for hubbies (in Czech manžové) and children.
The Fun Part
Almost all my friends became excited when I told them I have this data so they gave me a lot of input for what they would like me to do with it. I must admit that sometimes, I was just staring in the csv file laughing at the names of the recipes or even some ingredients (middle sized cucumber is the new running joke). They wanted to know the occurrences of certain words that made Mimibazar famous: “manža” (hubby), “čvachtat” (verb, literally means to slush, in the Mimibazar dialect eat with appreciation), “boule za ušima” (this comes from one of the famous sentences: “po tom si bude manža tak čvachtat, až se mu budou dělat boule za ušima… a nejen boule” which is totally untranslatable) and some ingredients like ketchup, vegeta, and gothaj salami.
And here are the results:
Because when downloading the data from the web, I purposefully included the url to the actual recipe, it is easy to trace back the recipe which included certain set of words. Here I looked for recipes containing both ketchup and tartar sauce. I found 72 recipes with this filter.
After downloading the JSON file from Geneea S3 bucket into my computer, I started planning my next steps. I would like to run the PMI algorithm on the data and see what is the relationship between the words in the ingredients. I am interested in seeing which ingredients are most likely to go together, which can potentially lead to a random recipe generator.
I managed to get familiar with the structure of the file and run some basic extracts from it using python.
I also signed up for a machine learning course on Coursera where I plan to acquire skills to be able to train my own program to recognize the difference between a quantity and an ingredient. I would like to use the data above to feed it. After I have this, I would like to pair it with prices and create a calculator to see, whether the argumentation of many Mimibazar users that they cook like this because they are short on money and that healthy food is more expensive is valid. I will be updating this article as I progress.
I forgot to mention that the course I attended started exactly 2 months ago and before, I only had very basic python skills (let’s say I had seen a for cycle from a train and had never worked with a dictionary). It is amazing to see how much you can learn in such a short time while having a full-time job when you have the right people to guide you. I am highly motivated to continue and maybe one day, make NLP the source of my living.
Special thanks goes to:
Czechitas for giving us this opportunity
My mentor Bert Šváb for mental support and keeping me motivated
Radoslav Klíč from Geneea for the language part and the time you invested in me
Petr Krebs from Avast Software for the crazy web scraping afternoon
Kryštof Večerek for being a great friend
Special no thank you goes to:
My cat which started peeing on my couch around 2 weeks into the course because of the lack of attention I was giving her