Exploring and Classifying Wine Enthusiast Reviews

Introduction

Published in

Data Insights

8 min readJul 13, 2017

The other day, I came across a dataset on Kaggle posted by Zach Thoutt which contained roughly 150,000 wine reviews scraped from Wine Enthusiast magazine. The dataset contains the price of the wine, where it originates from, a written description from a sommelier, and a 80–100 point rating. I thought it would be interesting to explore the data and see if I could use the text descriptions to classify wines (and sound smarter when I review my next one).

Warning: When inspecting the data, I noticed there was quite a bit of duplication. After running drop_duplicates() in Pandas, total reviews decreased from 150,930 to 97,851.

You can find the Jupyter Notebook used to generate these charts at the GitHub repository.

Grading Wines

Wine Enthusiast ranks wines on a 100 point scale with only 80+ point wines receiving a written review. According to this blog post, the scores roughly correspond to:

Classic 98–100: The pinnacle of quality.
Superb 94–97: A great achievement.
Excellent 90–93: Highly recommended.
Very Good 87–89: Often good value; well recommended.
Good 83–86: Suitable for everyday consumption; often good value.
Acceptable 80–82: Can be employed in casual, less-critical circumstances.

Wine Spectator lists a few more details for its own rating program. Wikipedia has some additional background and controversy into the rating system.

Point Distributions

If we look at the scores in the data, we see a pleasant, near-normal distribution of scores centered around the high 80s. Only 11 wines scored a perfect 100. The cheapest is the En Chamberlin Syrah from Cayuse Vineyards in Washington State for around $65 a bottle (probably more now since that was the 2007 vintage). The most expensive is the Clos du Mesnil Champagne from France, costing over $1,000 a bottle.

When looking at the distribution of points across different price ranges, there doesn’t seem to be too much of a bias, nor should there be. Since the wines are tasted blindly, we should assume that they are earning their score based on their quality, not perceived value. The chart below has price in a log scale as some bottles went up over $3,000. We see a general upward trend with price, but nothing too suspicious.

To get a better view (with wines I could afford), the next chart limits prices to below $100. We see that more expensive wines tend to receive fewer scores in the low 80s. There was less of an impact to the maximum points. My takeaway from these charts is that once you get up over $30-$40 per bottle you start weeding out lesser quality wines and can expect more consistency.

Wines by Geography

I was curious how point distributions varied across regions. Wine Enthusiast breaks down geography by Country > Province > Region. I combined the last two layers as {Region, Province} and plotted the 80–100 point distributions for each one. The top 20 are ranked in descending order by number of reviews.

It’s interesting to see how some regions, like Napa Valley, span the full range of scores, but then you have other regions like Tuscany or Champagne where all scores are higher across the board. Some regions like Sonoma County in California tend to receive lower points overall — compared to neighboring Napa Valley.

This made me think that perhaps some regions may excel in producing specific types of wine. So another way to look at this is to choose the top 30 regions and plot how well each region does against the top 30 types of wine.

In the heatmap below, I computed the 90th percentile for each region and for that type of grape. In other words, 90% of the rated wines receive up to this many points. We’d expect variation among different wineries so using percentiles helps cut down some noise.

Within each column I highlighted the highest scoring region (dark purple) along with the 2nd & 3rd place regions (light purple). Regions scoring outside of the top three were shaded in gray while missing data was left white.

Two regions clearly stand out. Napa Valley receives high marks for its Cabernet Sauvignon, Red Blend, Malbec, and a couple of white wine blends. The Walla Walla Valley takes 1st across a few white varieties like Chardonnay, and Sauvignon Blanc and also a couple of reds like Tempranillo. Both regions also hold top three ranks for several varieties of grapes making them the clear overall winners. Scores of other regions now have more context like Tuscany for example which produces less varieties but higher quality Merlot and Syrah.

Classifying Wines

Since each wine should have been expertly described by someone trained in the art of wine tasting, I thought I could use the data to:

Classify wines based on description
Recreate the proper red-white groupings from descriptors

I ran the descriptions through scikit-learn’s TFIDFVectorizer to find words which were useful in classifying the different varieties. The TF-IDF score finds frequently occurring words but down weights them if they occur often. For example, if “fruity” occurs in 99% of wine descriptions, then it is not a very informative word and consequently down weighted. Other words like “tobacco” may only appear in a subset of fuller body wines and would be useful in classification so they receive a higher weight.

In a slight twist, I only used a binary representation — meaning a 1 or 0 if the review did or did not mention the word. I was more interested in the # of reviews containing a term rather than the actual number of times that word appeared within a review.

I removed some junk words like: “red”, “wine”, “white”, etc. On my first attempt, I didn’t realize that the review descriptions often mention the name of the wine in the text. Obviously this is cheating when building a predictive model (data leakage) so I removed any mentions of the varieties within the text as well.

The chart below from Wine Folly shows what we are after. Building a hierarchy which uses these specialized words to group together wines based on their flavor profile.

[caption id=”attachment_415" align=”alignnone” width=”525"]

Wine Folly Different Types of Wine[/caption]

Since each wine variety has multiple reviews, I took the TF-IDF matrix and grouped each variety by its average score in order to represent how the “average” wine in that variety was typically described.

Using scipy’s clustering library, I preformed a hierarchical clustering calculation with cosine differences. Conceptually, this starts with a single variety and finds another variety that uses similar words with the same frequency and groups them. The process is repeated with the new group to find another larger group until we end up with a single family. For brevity, I limited the chart below to only the top 60 varieties by review count.

I’ve had mixed results doing this with other datasets so I was prepared to be let down. Instead, I was pleasantly surprised when I saw the above chart. Similar wines are actually very well grouped together. We even got a result that split out the reds from the whites! The lengths of the branches indicate how similar the descriptions of each wine is. For example, Merlot and Cabernet Franc are described much more similarly than either would be against a Syrah. I flipped though my Wine Folly guide and most of these seem to align very well to the proper families of wine.

Describing the wines

The chart is great, but what do the groups really mean? For each of the 8+1 clusters, I looped though and printed 5 random wine varieties and the top 20 words used to describe these wines (ranked by average tf-idf score). The output below represents descriptive words for that grouping (in bottom-up order). Cluster 1 below represents the green bottom group in the chart above.

***Cluster — 1***
Varities: Garganega, Glera, Prosecco, Moscato
Top Words: peach, aromas, honey, sweet, stone, fruit, mineral, flower, creamy, fresh, citrus, offers, tones, almond, pair, floral, mouth, apricot, crisp, opens
***Cluster — 2***
Varities: Sauvignon
Top Words: tomato, aromas, peach, exotic, fruit, palate, offers, crisp, fresh, acidity, mineral, stone, delivers, citrus, yellow, grapefruit, aromatic, green, mouth, note
***Cluster — 3***
Varities: Bordeaux-style White Blend, Rosé, Portuguese White, Champagne Blend
Top Words: acidity, crisp, fresh, ripe, character, fruity, drink, light, fruits, rich, texture, fruit, citrus, soft, attractive, ready, dry, bright, great, sample
***Cluster — 4***
Varities: Albariño, Verdejo, Torrontés
Top Words: aromas, palate, finish, citrus, nose, green, peach, lime, tropical, feels, fresh, lemon, drink, clean, good, fruit, acidity, feel, grapefruit, pineapple
***Cluster — 5***
Varities: Riesling, White Blend, Sémillon, Grüner Veltliner, Sauvignon Blanc
Top Words: acidity, citrus, pear, finish, fruit, crisp, peach, palate, dry, fresh, sweet, aromas, lemon, honey, ripe, green, notes, clean, rich, drink
***Cluster — 6***
Varities: Portuguese Red, Port, Gamay, Bordeaux-style Red Blend
Top Words: tannins, acidity, fruits, fruit, ripe, structure, wood, firm, juicy, rich, fruity, aging, character, drink, soft, sweet, dark, structured, fresh, dry
***Cluster — 7***
Varities: Barbera, Red Blend, Aglianico, Dolcetto, Primitivo
Top Words: cherry, aromas, spice, fruit, tannins, berry, bright, blackberry, offers, notes, tobacco, opens, palate, leather, dark, tones, pair, finish, expression, ripe
***Cluster — 8***
Varities: Tempranillo, Malbec, Tempranillo Blend, Garnacha, Carmenère
Top Words: aromas, palate, finish, berry, plum, herbal, nose, fruit, cherry, raspberry, feels, blackberry, oak, good, dark, notes, chocolate, spice, earthy, feel
***Cluster — 9***
Varities: Pinot Noir, Shiraz, Syrah, Petit Verdot, Cabernet Blend
Top Words: tannins, cherry, fruit, finish, blackberry, drink, dry, soft, oak, chocolate, sweet, palate, plum, aromas, ripe, rich, berry, dark, tannic, spice

Next Steps

In a follow up to this analysis, I’d like to dive in deeper with these three actions:

How are particular words correlated with another? If detect one or two flavors, is it more or less likely to detect a third?
Using kNN, SVM, or LogisticRegression can we build a predictive model to rank the likelihood of a wine belonging to one of the nine families based on descriptive words?
Can we build a recommendation engine to suggest possible wines to a user based on which flavors they like and which they dislike?