Creating the ‘Perfect’ Wine Using a Suite of Analytics Techniques (3 of 4)

6 min readJan 14, 2020

Random forests and neural networks are just a few tools used to predict the ‘perfect’ wine in this 4-part explorative series.

Overview

This article is Part 3 of a four-part series exploring the various techniques used in a full analysis and decision modeling process. Using a collection of nearly 300,000 wine reviews scraped from Wine Enthusiast Magazine, we’ll explore possible patterns based on a number of factors. Topic modeling and text grouping will be used to describe the common aspects of wines based on particular groups like the type of wine, origin winery, or country. We’ll use random forests and neural networks to build predictive models, which we can then use to try estimating the rating of wines based on their features. Finally, we’ll look at which variables play a role in the rating of a wine and try to develop the “perfect” wine.

This portion of the study will focus on building predictive models from the data. The data for this study can be downloaded from Kaggle.

Click here for part 1 and part 2.

Pruning the Dataset

Before building a predictive model for our data, we need to make sure we have the resources to meet our needs. Our data includes 51 countries, 491 provinces, 757 varieties of wine, and over 19,000 unique wineries. Since these are all potentially significant predictors, we want to be able to include them in our model. Short of using a data center’s supercomputer, it’s unlikely we’ll have the hardware to process so many variations of outcomes in a single model. Thus, we’ll filter some of our variables to use what we expect to be the most useful information.

We decided to use the top 8 most common provinces, varieties, and wineries for our model. As we saw in the previous data exploration, the majority of reviews come from a small percentage of geographic locations so this shouldn’t have too drastic of an effect on our model’s accuracy.

We also want to approach our ratings a little differently to make our results more accessible to the end-user once the model is completed. A numeric scale can be useful to a shopper or winery but giving each wine a quality designation should be more informative. We’ll use 5 labels to categorize our wines into one of 5 rating categories based on the numeric rating they received. The labels are Perfection (100–97), Excellence (96–93), Very Good(92–89), Good(88–85), and Fair (84–80). After categorizing, we want to check the distribution to ensure this system makes sense before continuing.

The distribution of ratings fits our previous findings, mostly normal with a slight skew right. As we saw earlier, the distribution of points was similarly shaped as few wines had a rating greater than 96. The graph below shows how ratings are spread among our top 8 countries

Cross-Validation

When developing a model, there’s the risk of the model becoming “overfit” to the data. Overfitting means the model is great at making predictions based on the data used to build the model, but it’s not very accurate when presented with new information. Because we want this model to be useful for future wine reviews, we need to prepare it to be used on unknown data using a method called cross-validation. The full dataset is split into two parts: a training set, to help develop our model, and a testing set, to act as ‘unknown’ data. It’s expected for the accuracy to be high on the training data, but the real test will be how accurate the model is on the testing dataset. We’ll compare model accuracy with both sets after each is developed.

For this study, we’ll use a method of cross-validation known as “k-fold” validation. K-fold splits the training set into ‘k’ number of groups, usually 5 or 10 groups based on the dataset size. One group is held out as a test set while the rest of the data is used as the training set. The model builds one iteration, evaluates itself on the testing group, then discards the model while keeping the evaluation score. This process is repeated once for each unique group, then the evaluation scores are summarized into a mean, variance, and other statistics for the model. When used in conjunction with a basic train/test split, we should see a strong model with minimal bias at the end.

Random Forest Model

A random forest builds numerous decision trees with factors changed at random, then summarizes their performance to built the optimal network of trees. Our forest contains 75 trees and uses 10-fold cross-validation, as mentioned above, to prevent a biased model. After processing the model, we evaluated its accuracy on both the training and testing datasets. The results for each test are below.

Random Forest summary as evaluated on the training and testing data sets

We can see that evaluation the forest on our training set resulted in an accuracy of about 0.6644. Our testing set is slightly lower at 0.5724 which is a relatively strong model. The naïve model, which attempts to predict with no information provided, is around 0.3787. Since our model has significantly better performance than no model at all, it’s safe to assume this is a good fit. A p-value of virtually 0 helps confirm that decision; p-values less than 0.05 are necessary when using a 95% confidence interval as we have here.

Next we’ll try building a neural network.

Neural Network Model

Neural networks are structured similar to the human brain. Input nodes send data from variables to layers of nodes which process the data in an attempt to identify patterns. This works like how our brains developed to look at the characteristics of an object to decide whether it’s food or a predator. Also similar to our brains, neural networks get stronger as they gain experience through studying more data. Ultimately, the model should be able to input information and output the classification of an observation. In our case, the network will look at information like a wine’s country of origin or price and identify which rating label would be expected.

Neural networks are complex models, and similarly there are a number of ways to fine-tune the parameters for the best fit. The most important features are the size of the network, which identifies how many nodes are used in a layer, and the decay. Decay has a complicated background, but in essence it helps prevent overfitting by gradually reducing the effect certain parameters have on the model. It’s possible to specify the values for each in our model, but we want to try and find the optimal model for predicting our wine ratings. We’ll search for the best model with a size between 1 and 12, and a decay rate between 0.1 and 0.5 and compare the accuracy of all 60 configurations.

The graph above shows the outcome of our search. Each colored line represents a decay rate, and each point on the graph shows the accuracy of a model with a particular layer size. In the top-right corner, we see the 0.1 decay rate seems to have the best accuracy with a size of 10, so we’ll build our network with those specifications. We can see the results of this network below.

Fitted Neural Network summary as evaluated on the training and testing sets

Most importantly, the accuracy of our testing set was about 0.6265, suggesting a stronger model than our random forest. Our p-value is also supportive of the model’s strength. Kappa value compares the model’s accuracy to a random system. Higher values indicate greater accuracy over the random system, while lower values suggest a model is not much better than a random system. In this case, a value of 0.4508 indicates our model is significantly stronger than a random model and would be a good tool in predicting wine ratings.

Don’t miss the earlier work that got us to this point! Click here for part 1 and part 2.