Give me a drink, Cocktail Recommender

Published in

Analytics Vidhya

5 min readDec 21, 2020

Everybody should believe in something
I believe I’ll have another drink.
– Albert Einstein

I’m not sure Albert Einstein said these words, but I’m sure he would have said it if he had the chance to use our Cocktail Recommender.

Cocktail Recommender is a system that recommends users new cocktails among 1049 unique cocktails.

It is deployed on https://cocktail-recommender.herokuapp.com/ using Streamlit and Heroku.

Note: The web page may take a few seconds to load because it is deployed using free trial of Heroku.

Overview

There are hundreds of cocktails yet most people know only the famous ones. Additionally, it can be difficult for many people to try new cocktails they think they will love. Most of the people might hesitate to try something new when it comes to the cocktails, because there are many options with different ingredients. Our cocktail recommender system encourages people to try new flavours by utilizing data science and various cocktails in the database.

The web app provides recommendations based on two types of inputs, cocktail names or ingredient names. If the user gives name of cocktail, the system will recommend cocktails with similar ingredients.

The assumption here is that if people love a cocktail, they will love cocktails similar to that.

When a user provides a cocktail name that is present in the database, a bar chart showing cocktail names with corresponding similarities come up.

Getting Data

Two different datasets were used to create the database of the system.

Hotaling & Co. Cocktail Dataset on Kaggle This dataset contains 600+ cocktails with ingredients, recipes, location, name of bartenders, etc.
TheCocktailDB 500+ cocktails with the ingredients and recipes were streamed using their API. Details of the streaming can be found on jupyter notebook CocktailDbStreaming

These two datasets were merged and we obtained a dataset with 1049 unique cocktails with their ingredients and recipes.

Methodology

Preprocessing for Tf-idf Vectorizer

Some preprocessing has been done before applying actual tf-idf vectorization.

Stop words: In addition to ENGLISH_STOP_WORDS, the words ‘oz’, ‘simple’, ‘dash’, ‘bsp’, ‘drops’ are added as stop words. These words are present in many ingredients and don’t provide much information about uniqueness of a cocktail.

Token pattern: Tokenization is only done to elements with alphabetical characters. This is done to prevent to get unnecessary tokens of numbers and symbols.

Implementation

The system makes recommendations based on similarities of the ingredients. Each cocktails’ ingredients are vectorized using scikit-learn’s TfidfVectorizer class. This technique is simple yet powerful. It transforms cocktails with similar ingredients to similar vectors. One major advantage of the method over Bag of Words method is that rare tokens can get high weights. In this way, similar cocktails with rare ingredients could be related more successfully.

Cosine similarities of vectorized ingredients are calculated using sklearn.metrics.pairwise.linear_kernel class. The details of the recommendation system can be found on this python file.

If the user gives name of a cocktail that is present in our database;

The cocktails with the highest similarity values to given cocktail are recommended to the user with corresponding ingredients and recipes.

If the user gives name of the ingredients;

The given ingredients are vectorized using the same vectorizer. Cosine similarities between vectorized input ingredients and cocktails’ ingredients are calculated. The cocktails with the highest similarity values are recommended to user.

Refinements

In the first version of the system, only Hotaling & Co. Cocktail Dataset on Kaggle were used. Despite this dataset has many original cocktails, it lacks some popular ones. It is essential for our database to include popular cocktails because most of the time users will provide these ones. TheCocktailDB is a great website having data of both popular and rare cocktails. Therefore, data from TheCocktailDB has been scraped by using their API and merged with the initial dataset.

Photo by Louis Hansel @shotsoflouis on Unsplash

Another refinement was adding stop words of ‘oz’, ‘simple’, ‘dash’, ‘bsp’, ‘drops’ to the default stop words in english. These words are present in most of the ingredients and don’t provide valuable information in recommendation. Additionally, it is observed that they degrade the performance of the system. For example, if two cocktails have words of ‘drops’ and ‘oz’ a lot, the calculated similarity between them is more than the actual value.

The final refinement is that using a token pattern to only tokenize alphabetical elements.

Conclusion

Transforming text data into vectors and calculating cosine similarity values between these vectors helped us to build the system. Obtaining data from different sources, merging the data together and doing some preprocessing were needed at first.

The most challenging part of the project was to find suitable data for it. There are many great website of cocktails with ingredients, recipes and user reviews. However, all of them prohibited web scraping. Hence finding open source data was one part of the challenge.

Another difficulty about this system is that it is hard to evaluate the performance of the recommendations objectively. Therefore, many users have been reached to try the system and feedbacks are received. Most of the users stated that they are generally happy with the recommendations.

One improvement might be utilizing user reviews for the cocktails. This might improve the performance because in this way, we would also take human perception about cocktails such as refreshing, fruity, etc. into account to do recommendations.

What do you think about the recommendations you get?
Which other techniques or data could be used to improve recommendation performance?

All the codes and data are available on Github.

Hope you enjoy this post! I would be happy to receive your feedback, you can contact me on Linkedin.

Cheers!

Acknowledgment

I would like to thank Patryk Oleniuk for helpful post about quickly deploying ML models to web. I have used the template mentioned in the post.