Creating the ‘Perfect’ Wine Using a Suite of Analytics Techniques (2 of 4)

5 min readJan 13, 2020

Random forests and neural networks are just a few tools used to predict the ‘perfect’ wine in this 4-part explorative series.

Overview

This article is Part 2 of a four-part series exploring the various techniques used in a full analysis and decision modeling process. Using a collection of nearly 300,000 wine reviews scraped from Wine Enthusiast Magazine, we’ll explore possible patterns based on a number of factors. Topic modeling and text grouping will be used to describe the common aspects of wines based on particular groups like the type of wine, origin winery, or country. We’ll use random forests and neural networks to build predictive models, which we can then use to try estimating the rating of wines based on their features. Finally, we’ll look at which variables play a role in the rating of a wine and try to develop the “perfect” wine.

This portion of the study will focus on exploring the data. The data for this study can be downloaded from Kaggle.

Click here for part 1 of this study.

Exploring the Data

With our dataset prepared, we can now try to find some information around the wine reviews. As shown below, ratings are normally distributed around the 80–100 range, with a mean rating of about 88. It’s important to consider the source of our data, WineEnthusiast, only posts reviews for wines with a rating of at least 80. Very few wines (about 0.5%) were rated at 96 or higher, which is important to remember for later.

Price (below) doesn’t have as much of a normal distribution. We see there are a few outliers higher than the majority of wine prices in this study. As mentioned earlier, outliers can affect imputed values if we’re not careful. However, because there are so few in such a large dataset, their impact is not significant. We see evidence of this with 97.5% of wines priced at $105 or lower. Adjusting the scale to only include prices up to $150, we can see the distribution is still skewed right but much more informative now. Most wines are priced within the 10–40 range, with 75% of wines priced at no more than $40. In our model building, we used wines in the $0–100 price range.

Most wines in this study came from the United States, France, or Italy. The top four producers of wine worldwide are (in ranking order) Italy, France, Spain, and the U.S., so proportionally it should be lower in this list. This indicates the study involved primarily American wines. We can see in the lower chart that California is by far the largest source of wines in this study, which is unsurprising as it has nearly 4,400 wineries against 772 in Washington.

Our data includes 51 countries, 491 provinces, 757 varieties of wine, and over 19,000 unique wineries. Even if we had the computing power to process so many variations, the amount of information we’d gain would be overwhelming. So, we’ll instead use the 8 most common countries, provinces, and varieties for our analysis.

Deciphering the Descriptions

The graphs above show the most common words used in descriptions of wines from the 8 most common countries. While some words are common between them, we can start to develop a profile of what defines wines from each country. Argentina tends to sport wines with a berry or plum flavor, while French wines seem to have a richer taste with terms like “crisp” and “black.” We see similar trends below in the descriptors by province. The Bordeaux province of France has similar mentions of ripe, rice flavors with mention of wood, presumably in the wood vessels used to store the wine. The Mendoza province of Argentina uses words like berry, plum, and herbal.

Looking at the comparison of wine varieties, we can see more obvious differences. Red wines like the Cabernet Sauvignon and Pinot Noir focus on cherry and berry flavors. White wines like the Sauvignon Blanc use more words like citrus and crisp, reflecting their lighter flavor. Rieslings, known for their aromatic and fruit aroma, tend to mention terms like apple, lemon, and lime.

Lastly, we can see some patterns in the words associated with ratings. Using a scale of only 80–100 provides a small window for comparison, but we can still see some differences between the extremes of our scale. Higher-rated wines make frequent mention of tannins and rich or black flavors, suggesting richer wines are rater higher. Wines lower on the scale tend to mention “sweet”, “aromas”, and “cherry” more often, which suggests the fruitier wines aren’t rated as highly. In general, it appears rich red wines are considered higher quality than lighter white wines.

Don’t miss the earlier work that got us to this point! Click here for part 1 of this study.

Creating the ‘Perfect’ Wine Using a Suite of Analytics Techniques (2 of 4)

Overview

Exploring the Data

Deciphering the Descriptions

Written by Damon Roberts