Creating the ‘Perfect’ Wine Using a Suite of Analytics Techniques (1 of 4)

Damon Roberts
5 min readDec 11, 2019

--

Random forests and neural networks are just a few tools used to predict the ‘perfect’ wine in this 4-part explorative series.

Photo by Maksym Kaharlytskyi on Unsplash

Overview

This article is Part 1 of a four-part series exploring the various techniques used in a full analysis and decision modeling process. Using a collection of nearly 300,000 wine reviews scraped from Wine Enthusiast Magazine, we’ll explore possible patterns based on a number of factors. Topic modeling and text grouping will be used to describe the common aspects of wines based on particular groups like the type of wine, origin winery, or country. We’ll use random forests and neural networks to build predictive models, which we can then use to try estimating the rating of wines based on their features. Finally, we’ll look at which variables play a role in the rating of a wine and try to develop the “perfect” wine.

This portion of the study will focus on cleaning and preparing the data. The data for this study can be downloaded from Kaggle.

Data Preparation

In this study, we’re provided two large datasets with observations of wine ratings which we’ll refer to as Wine130k and Wine 150k, referencing the number of observations in each. The first step in preparing for analysis is merging these two files into a single data source for easier manipulation. We’ll use RStudio for most of our analysis, which requires some pre-processing before merging files.

Wine130k contains 11 variables, while Wine150k contains those same 11 plus 3 for a reviewer’s name, twitter handle, and wine name. We need both files to contain the same variables before they can merge, so 3 variables are created in Wine130k with values of “NA” for each observation. This allows us to aggregate both files by matching variables. The final result is a data frame with 280,901 observations of 14 variables each.

With our single dataset, we can now begin looking at any missing data and attempt to work around it. An initial analysis shows almost every observation is missing at least one variable. Of our roughly 281,000 observations, less than 8% have full data for each variable. Taster_name, twitter_handle, and title are expected to be commonly missed since half our dataset initially didn’t have the variable. Region_2 is another important variable, with about 60% of our observations missing a value for the field. Designation is missing in about 30% of observations, and perhaps most importantly, price is missing from about 10% of our observations.

There are two main ways we can deal with missing data: either ignore any incomplete observations or attempt to fill the missing values. We’ll try a combination of both approaches to build a suitable dataset for our purposes. Using predictive mean matching with the ‘mice’ package in R, we’re first able to fill all missing values for price with an estimated mean based on the known values of the dataset. One risk of using this method is being susceptible to outliers, so we’ll want to double-check the data to make sure nothing alarming happened.

We can see above the difference in the dataset before and after imputing missing ‘price’ values. The mean changed from 34.18 to 34.47 while all other statistical measures stayed the same. Even with some outliers, the dataset was large enough to negate their effect. We can then decide how strictly we want to remove incomplete observations. Removing all observations with a missing value reduces our dataset down to about 22,000. If we choose to ignore the taster name, twitter, and title variables the final dataset is around 73,300 observations. While we do plan to study the relationship between tasters and their ratings, most of our analysis will not suffer from missing those three variables. Therefore, we’ll move forward using the larger pruned dataset of 73,659 observations.

Considerations

Something to consider for this study is how the results will represent the larger population of all wines worldwide. Since we have neither the resources nor interest to study the estimated millions of unique wines in the world, we’re looking at a small sample of roughly 300,000. This will ideally give us information we can use to describe the worldwide selection. However, there are some shortcomings we need to be aware of. First, the majority of wines in our data come from the United States, specifically California. We should feel confident by the end of the study in making claims about wines from California, or possibly France and Italy, but we’ll need to be cautious about applying our findings to all wines in the world.

We also need to consider the size of our sample frame. The equivalent of an estimated 36 billion 750mL bottles of wine are produced each year worldwide. It’s safe to assume there are millions of variants in existence, from which we’ve collected reviews for about 280,000. After cleaning our dataset, we were left with about 73,600 wine reviews, and that’s without checking for duplicates. Our findings will come from studying a set of wines making up less than 1% of the worldwide population. To be truly vigilant, after completing this study we should collect more reviews from other sources and perform the same steps, repeating the process many times. However, for the sake of this study, we’ll be satisfied with one full study.

When we begin building our predictive models, we’ll still want to plan for future wine reviews in case the study does repeat. Or, in the case of predicting ratings based on descriptions, we want a model that can take a brand new review and accurately predict the wine’s rating. To do this, we’ll split our dataset into two subsections, creating a training and testing dataset. The ratio can be flexible, but in this case we’ll randomly assign 70% of reviews into the training set and 30% into the testing set. The training set will be fed into our models to help develop the algorithms predicting wine ratings.

When we have a completed model, we’ll then apply the algorithm to the testing set and check the model’s rating predictions against the actual ratings of each wine. Performing this test on each model will give us an overall accuracy to help compare which model will be the best fit. Models which have strong predictive accuracy on the testing set should have similar results when used on other ‘unknown’ data.

--

--

Damon Roberts

Data Visualization Architect & Deep Learning Afficionado