Recommending the Perfect Bottle of Wine
Like most good brothers, I try to buy thoughtful gifts for my sister for special occasions. As it happens, my sister likes wine, and I like data, so when I came across this data set that has scores and reviews for more than 130,000 wines, it was a perfect chance to use my interest in data for a good purpose.
It started with an innocent text to my sister:
The goal was to build a recommendation system using a list of wines that she likes to pick out a few that I could buy for her as a birthday gift. Though the sample size of choices would be small — a handful of wines relative to 130,000 — we can use filtering and language processing to make initial predictions, and over time as the list grows of both wines that she enjoys and doesn’t enjoy, we can continue to feed the data into our recommendation system and re-run the code.
The data and code
The data was scraped from WineEnthusiast and posted by a Kaggle user in a .csv format. It includes~130,000 total reviews of different wines with variables such as the wine’s country, description by the reviewer, points ranking, price, region, variety, and winery.
Full code for the project is posted on my GitHub here.
Recommendation Systems: What are they?
Recommendation systems typically take one of three forms: Collaborative filtering, content-based filtering, and a hybrid of the two.
Collaborative filtering is analyzing behavior by one user and making recommendations based on other similar users. For example, Netflix knows which shows I’ve watched based on my viewing history and can recommend shows that other users “like me” (i.e. with a similar viewing history) have also watched.
Content-based filtering is isolated to a single user and, taking the items that they’ve purchased or used in the past, looks for similar items by description or other identifying variables. So, when I buy a certain coffee blend on Amazon, I could be recommended other varieties that have the same roast, are from a similar region, and/or have overlapping words in the description.
The hybrid method is just as it sounds — a combination of both content-based and collaborative filtering. If we go back to Netflix, they could (and do) build a model that can identify similar shows to my viewing history, then rank those shows based on the viewing behavior of people like me.
For this project, because we don’t have other users to compare against, we’ll be using strictly content-based filtering. The inputs will be the wines that we know that my sister likes, and the recommendation will be those that exhibit similar characteristics — region, description, variety — to those wines.
I decided to build three separate algorithms and then blend the three for the final choice(s). The first uses the characteristics of the wines that she already has enjoyed to find the highest rated wines within those characteristic filters; the second uses text data and similarity to rank the wines; the final uses language frequency techniques in Python’s scikit-learn library.
Basic Recommendation System
The basic recommendation system identifies the characteristics of the wines that Arielle likes (“training” wines) and searches for other wines that match those characteristics, then ranks them based on points and price.
This is accomplished by setting up a series of filters on the structured variables of year, price, country, and variety; in later iterations, we’ll use the more unstructured text data in the description and title columns.
This will give us only the wines that come from the list of countries from our “training” data, that were produced within a fixed range of years, are within our price range, and are the types/varieties of wines that we know she likes.
After setting up these filters, in order to get a mix of types and price points, I identified the top ranked wine for each of the unique varieties, with one choice below the midpoint pricing and one above. The final result was a set of 8 wines from 4 varieties (GSM, Pinot Noir, Sparkling Blend, and Syrah), with each variety having a low and high-priced option.
Look for the wines that match the most common characteristics of the ones that we that Arielle already enjoys.
This includes setting up year, price, variety, and country filters. For each unique variety, find a high and low priced wine within our range that is the top rated and fits our filtered variables.
The function will return a data frame with two recommendations per wine.
current_wine = df[df['arielle_choice'] == 1]
current_filter = (df['arielle_choice'] != 1) # We don't want a wine that she already has tried
year_min = current_wine.year.min()
year_max = current_wine.year.max()
year_filter = ((df.year >= year_min) & (df.year <= year_max)) # Filter for the year range of wines that she enjoys
price_min = current_wine.price.min()
price_max = current_wine.price.max()
price_mid = price_min + ((price_max - price_min) / 2) # Create midpoint for high/low price adjuster
price_filter = ((df.price >= price_min) & (df.price <= price_max)) # Filter for her typical price points
countries = list(set(current_wine.country))
country_filter = (df.country.isin(countries)) # Filter for country
varieties = list(set(current_wine.variety))
variety_filter = df.variety.isin(varieties) # Filter for wine type
filtered_df = df[current_filter & year_filter & price_filter & country_filter & variety_filter] # Filtered data frame
recommendation = 
for variety in varieties: # Recommendation for each of her favorite varieties
var_df = filtered_df[filtered_df.variety == variety]
lower_price = var_df[var_df.price <= price_mid] # One pick that's below the mid-point pricing
higher_price = var_df[var_df.price > price_mid] # and one that's above
top_rec_low = lower_price[['country','designation','points','price','title','variety']].sort_values(
by='points',ascending=False)[:1] # Extract the highest rated lower-priced wine
top_rec_high = higher_price[['country','designation','points','price','title','variety']].sort_values(
by='points',ascending=False)[:1] # Extract the highest rated higher price wine
recommendation.extend(top_rec_low.index) # Add the index value of the lower priced wine to the list
recommendation.extend(top_rec_high.index) # Add the index value of the higher priced wine to the list
rec_df = df.loc[df.index.isin(recommendation),:] # Extract only the recommendation index values from the data
rec_df.sort_values(by=['variety', 'price'], ascending=True, inplace=True) # Sort/group by wine type then price
return rec_df # Return the recommendation data frame
The limitations of the above are that we’re narrowing our view of the universe to a set of common and consistent characteristics. The recommendations are all on the West Coast (and in the U.S.) — since that’s where she is now and what she tries the most — and between the years of 2005–2014. So, while this is a good start, we still want to expand our horizons beyond the U.S. borders and into new areas, which leads us to…
The second recommendor system that we’ll build will combine all of the text and values in a given row, then look for the number of unique words in that observation that match the words in our “training” choices.
We’ll do this by using the NLTK library in Python to “tokenize” the text in each observation based on a pre-defined pattern.
Tokenization is a way to split a set of text values into unique items based on a pattern — for example, we could extract only words, words and digits, words, digits, and special characters, etc. from a text string.
For example, tokenizing the first sentence in the last paragraph would yield a list of values [“Tokenization”, “is”, “a”, “way”, “to”…]. To take it a step further, we remove the common “filler words” (also referred to as stopwords) that don’t add value to the meaning of a sentence (“a”, “an”, “the”,…). The final tokenized version of the sentence with stop words removed would look like: [“Tokenization”, “way”, “split”, “set”, “text”, “values”…].
With each sentence split into its set of components, we can then work with text on a more micro level to look for patterns and similarity across observations.
Next, we’ll create a single list of all of the unique words in the combined “training” reviews (i.e. the wines that Arielle already told me that she likes).
We’ll then loop through the tokenized words in each observation; for each tokenized word, if it appears in our single training list of words, we’ll add 1 to a counter for that variable, with the final counter value being a proxy for similarity.
Once we’ve prepared the data by joining each entry of a row into a single string of text, tokenized the values, and removed the stopwords, we’re ready to look for the similarity of the data to our desired string of training text.
After applying some additional sorting and filtering, I scaled the new similarity variable (number of matching words) using a MinMaxScaler (which puts the data on a scale of 0–1 based on its value relative to the data’s range) and found the top results for both U.S. wines and international wines.
The limitations here are that the vocabulary to describe a wine is essentially infinite. One person can call a wine “fruity” while another can say “bursting with fruit flavors”. The meaning is essentially the same, but our system wouldn’t pick up that difference.
Though we’re bringing in more data and factors, we still have an inherent bias toward U.S. wines because the country and regions that match what she likes are included in our text strings. Therefore, to try and offset this, we’ll use only the description column, which is the review associated with each wine that describes flavor, ingredients, or other perceptions of quality by the reviewer.
While the first two recommendations were built mainly from scratch with the help of other libraries, we’ll be leveraging a pre-packaged library to do the heavy lifting for the final system. The goal will be to use only the description column to find the other wines in the data set that are the most similar to our training data. It’s roughly comparable for what was done in the second recommendor but takes a more mathematical approach.
The core of this system will be built on what’s called Term Frequency Inverse Document Frequency (TFIDF). I found this article that explains the concept really well.
The high-level summary is that, given a series of text observations, term frequency (TF) is a measure for the frequency of a word within a single text string (observation), while the inverse document frequency (IDF) is a measure of how often it occurs across all of the observations in the sample. The TFIDF value is then calculated as the product of the TF and IDF.
In our data, the TF value for an observation would be the frequency that a word within a wine description (let’s say that the word “berry” is in a description) appears within that description as a percentage of all words, while the IDF would measure the number of times “berry” appears across all of the descriptions of wine in the data.
After computing the TFIDF value for each word in each observation, we need a way to measure similarity between a single observation and the others in the data set. Again, bringing it back to the task at hand, this means taking Arielle’s preferred wines and finding the wines that have a description most similar to those wines based on their TFIDF values.
There are a number of different methods to compute similarity. After trying a few and doing some reading on different methods, I settled on cosine similarity. The article linked above on TFIDF does a good job of explaining cosine similarity too.
Using our data, which has been converted from text to numbers (their TFIDF value), we can plot those numeric points in a “vector space” and compute the distance — using linear algebra and trigonometry — from that point to all of the other points that are also plotted in the vector space.
If that doesn’t make much sense it’s okay because it’s a pretty confusing concept for me too, but the output of cosine similarity ranges from -1 to 1, with -1 meaning two items are the exact opposite and 1 meaning they are the same. When working with text data, the output range shrinks to 0 to 1.
The result, by looping through each of the wines that she enjoys, is a set of recommendations that are the most similar descriptions in the rest of the data.
As was the problem in the previous exercises with U.S. bias, there’s a good chance that the flavor and way that U.S. wines are described is different than international wines, so in addition to the general recommendations, I filtered for only non-US wines and picked the top matching wines for that group as well.
One final note for this method was that, for the sake of computing similarity, I only used the wines that were already in Arielle’s “preferred varieties” (i.e. types of wine: Syrah, GSM, etc.) because the size of the file inhibited using the whole list of 130K+ descriptions locally on my computer.
Conclusion: Which wine should we pick?
Interestingly, there was no overlap between the three different systems, which goes to show the subjectivity that’s present within recommendor systems. Each of the three recommendors individually takes a separate but plausible route to develop the recommendations, yet the collective result is unique.
Given that, once aggregating the results from each of the methods, we had about 60 “recommendations” to sort through. Since I can’t buy her all 60, it’s time to apply some “art” to the “science” of recommending.
I started by eliminating the wines that didn’t have an associated price or that were more than $50 (sorry Arielle — I can’t afford the really good stuff yet). Plus, as a Stax big data analysis shows, there’s plenty of value to be had in sparkling wines and Champagne under $50.
Then, knowing my sister and that GSM is one of her favorites within the varieties that she gave me, and inserting my own bias for Syrah, I filtered the data for only those two types of wine as well as the wines with a score of 90 or better.
Finally, I decided that I wanted one of each variety, and that one would be local to the U.S. while the other would be from outside of the country. Because there was only one international wine left at this point and it was a GSM, that became the easy choice for the GSM variety. For the Syrah, I applied a filter that would return the highest ranked (by points) Syrah made in the U.S.
And that’s it! From 130,000 wines down to just two using data to come to a solution. Our final winners were a 2010 California Syrah from the Fenaughty Vineyard in the Sierra Foothills and a 2011 Plexus GSM from South Australia.
The reality of finding somewhere that I can buy them and get them delivered is the next challenge, but the fun of the journey to get to the right wines makes it all worth it in the end.