I’ll never forget my first time ordering wine at a restaurant.
I had just turned 21 and was excited to finally order a drink at dinner with my parents. When I asked the waitress for a wine recommendation, she flipped it back on me and asked what I was looking for in the wine. Having zero clue what to say (or what the question even meant), I sat there in silence for a second before admitting my ignorance — it made me felt pretty dumb!
When I got home that night I spent a few minutes on google to find some wine vocab words to keep in my back pocket. While these have been helpful, the truth is that I still have no idea what deems a wine “good”.
So in order to answer that question (and avoid feeling dumb when ordering wine again), I decided to build a series of Machine Learning models to predict the quality of a wine based on its characteristics.
In this article, I’ll walk you through the process of building the models and discuss my findings.
- Fixed Acidity: contribute to sourness
- Volatile Acidity: high levels give bad vinegar taste
- Citric Acid: contributes to ‘fresh’ flavor
- Residual Sugar: makes wine mellow and flavorful
- Chlorides: saltiness
- Free Sulfur Dioxide: protects wine from spoiling
- Total Sulfur Dioxide: also protects from spoilage
- Density: correlated with high quality and wine ‘legs’
- pH: acidity
- Sulphates: preservative
Each wine in the dataset also has a 1–10 rating of its quality. So the goal of building the models is to try and predict the quality of the wine based on the levels of the above characteristics.
Building the Models
In Machine Learning, its important to do some pre-processing on the data in order to get the best possible model performance and avoid problems like overfitting.
The first thing I did was manipulate the raw data so the ‘Quality’ variable went from being a score of 1–10 to choosing an arbitrary cutoff point so that it becomes a binary value of the wine being either “good” or “bad”. By simplifying the data and creating a true classification problem, the accuracy of each model increased significantly. After that, I used a tool to normalize the range of the features.
Next, a key step in any Machine Learning problem is splitting the data into training and testing sets. These models work by learning trends in the training data since the training data contains the “answer” (in this case, the wine quality). The testing set is a new segment of the same data with the “answers” removed. By feeding the the testing set into the model so it can make predictions, we’re able to get an accuracy score to see how well the model works on new data.
As figure 1 shows, all five of the models performed similarly well. The Random Forest model performed the best with an overall accuracy score of 83%. This means that if you give this model information on 100 different wines, it will be able to accurately predict whether the wine is good or bad 83 times.
The Random Forest works by making a bunch of decision trees and joining them together to create the most accurate outcome. This structure is particularly helpful because it provides empirical insight into which features in the data are most important for prediction.
For our wine problem, the level of alcohol seems to be the most important feature by a significant margin, according to the Random Forest. Sulphates, Volatile Acidity, and Total Sulfur Dioxide are also significant factors in determining the quality of wine.
These findings can be useful for winemakers because it provides a framework for which wine properties they should focus on during production in order to create the highest quality wine.
And perhaps more importantly, I can use these findings to go back to that restaurant where I ordered my first glass of wine and not feel dumb. When the waitress asks me what I’m looking for, I’ll take a peak at my Feature Importance chart and gladly tell her I’m looking for a balanced amount of Alcohol, Sulphates, and Volatile Acidity.