Pairing Wine with Machine Learning

JP Velasquez
May 7 · 4 min read

Building a series of ML models to identify what determines the highest quality wine

Image Source: https://www.independent.co.uk/extras/indybest/food-drink/wine/best-non-alcoholic-wine-low-alcohol-b1780023.html

I’ll never forget my first time ordering wine at a restaurant.

I had just turned 21 and was excited to finally order a drink at dinner with my parents. When I asked the waitress for a wine recommendation, she flipped it back on me and asked what I was looking for in the wine. Having zero clue what to say (or what the question even meant), I sat there in silence for a second before admitting my ignorance — it made me felt pretty dumb!

When I got home that night I spent a few minutes on google to find some wine vocab words to keep in my back pocket. While these have been helpful, the truth is that I still have no idea what deems a wine “good”.

So in order to answer that question (and avoid feeling dumb when ordering wine again), I decided to build a series of Machine Learning models to predict the quality of a wine based on its characteristics.

In this article, I’ll walk you through the process of building the models and discuss my findings.

Data Overview

The dataset I used contains 1,600 wines with information on the following characteristics for each:

  • Fixed Acidity: contribute to sourness

Each wine in the dataset also has a 1–10 rating of its quality. So the goal of building the models is to try and predict the quality of the wine based on the levels of the above characteristics.

Building the Models

In Machine Learning, its important to do some pre-processing on the data in order to get the best possible model performance and avoid problems like overfitting.

The first thing I did was manipulate the raw data so the ‘Quality’ variable went from being a score of 1–10 to choosing an arbitrary cutoff point so that it becomes a binary value of the wine being either “good” or “bad”. By simplifying the data and creating a true classification problem, the accuracy of each model increased significantly. After that, I used a tool to normalize the range of the features.

Next, a key step in any Machine Learning problem is splitting the data into training and testing sets. These models work by learning trends in the training data since the training data contains the “answer” (in this case, the wine quality). The testing set is a new segment of the same data with the “answers” removed. By feeding the the testing set into the model so it can make predictions, we’re able to get an accuracy score to see how well the model works on new data.

I built five different types of models to find which one would perform the best: Logistic Regression, Bagging, Boosting, Random Forest, and Voting Ensemble.

Figure 1

Key Findings

As figure 1 shows, all five of the models performed similarly well. The Random Forest model performed the best with an overall accuracy score of 83%. This means that if you give this model information on 100 different wines, it will be able to accurately predict whether the wine is good or bad 83 times.

The Random Forest works by making a bunch of decision trees and joining them together to create the most accurate outcome. This structure is particularly helpful because it provides empirical insight into which features in the data are most important for prediction.

Figure 2

For our wine problem, the level of alcohol seems to be the most important feature by a significant margin, according to the Random Forest. Sulphates, Volatile Acidity, and Total Sulfur Dioxide are also significant factors in determining the quality of wine.

These findings can be useful for winemakers because it provides a framework for which wine properties they should focus on during production in order to create the highest quality wine.

And perhaps more importantly, I can use these findings to go back to that restaurant where I ordered my first glass of wine and not feel dumb. When the waitress asks me what I’m looking for, I’ll take a peak at my Feature Importance chart and gladly tell her I’m looking for a balanced amount of Alcohol, Sulphates, and Volatile Acidity.

Geek Culture

Proud to geek out. Follow to join our +500K monthly readers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store