Wine quality prediction with python

Published in

Analytics Vidhya

7 min readSep 21, 2020

Warning: This is long article for those who seek only machine learning code, please just go right to the the last section, but for those who actually come to learn something from the data please carefully read the whole process and prepare a question for me.

First, let’s call the elephant in the room, “Why we have to detect the quality of wine?” And “Can quality be predictable?”. These latter question arises from the subjective perspective of person toward a preference of wine testing, in acquiring the quality of the wine, we will focus only on the objective part which is the actual quality such as balance, aroma, not the flavors. The former question arises from the person who has not enter to the realm of wine drinking before, the quality of the wine cannot simply be classified as good or bad as same as other food, it needs many metrics to consider.

Now is the real question in this statistics acquisition

“Can quality of wine be determined without using wine testing techniques, in other words, can we used predictive technique of machine learning to quantify the wine quality with scale and lesser expertise?”

The question whether statistical techniques can be used to determine the quality of wine without the need for individual inspection of wine tester will be carry out in this article.

DATA

This dataset is common in Kaggle open sources however, you might actually find articles in medium that used the cool and advance machine learning algorithm for solving this problem like Random Forest, Neural network and Support Vector machine and telling you what is the accuracy, precision, recall and f-1 score of the model without actually doing the proper knowledge acquisition of where quality of wine derived from. Of course, the power of machine learning do not require the one who implement it to actually know the input and give a great result. But it sure will help the machine learns better if the one who supervised the machine understand what are the relationships of inputs to output.

Let’s do the machine learning in the proper way like statistics, now our data is the red and white wine data with their physio-chemical properties. The grape that used in the wine is the Portuguese varietal of Vinho Verde and the document for data mining study can be found here. First, let explore the overview of physio-chemical feature.

Understanding data

Fixed acidity: The non-volatile acid found in wine whihc are tartaric, malic, citric, and succinic. All of these acids originate in grapes with the exception of succinic acid, which is produced by yeast during the fermentation process.

Volatile acidity: the acidic elements of a wine that are gaseous, rather than liquid, and therefore can be sensed as a smell, showing an aroma, rather than found on the palate.

citric acid: a weak organic acid, which is often used as a natural preservative or additive to food or drink to add a sour taste and freshness to food.

residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.

chlorides: the amount of salt in the wine

free sulfur dioxide(SO2): SO2 is used throughout all stages of the wine-making process to prevent oxidation and microbial growth. Excessive amounts of SO2 can inhibit fermentation and cause undesirable sensory effects.

total sulfur dioxide: amount of free and bound forms of S02S02; in low concentrations, SO2SO2 is mostly undetectable in wine, but at free SO2SO2 concentrations over 50 ppm, SO2SO2 becomes evident in the nose and taste of wine

density: the density of water is close to that of water depending on the percent alcohol and sugar content

pH: Winemakers use pH as a way to measure ripeness in relation to acidity. Low pH wines will taste tart and crisp, while higher pH wines are more susceptible to bacterial growth. Most wine pH’s fall around 3 or 4.

Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02S02) levels, which acts as an antimicrobial and antioxidant.

alcohol: the percent alcohol content of the wine.

Now we have a rough understanding of our features, let’s look at the our dataset.

RED VS. WHITE

In the world of wine, the difference between red and white wine is not just its color, but the ingredient, the method of wine-making and aging. Because of these differences, the physio-chemicals will be different determinant when it comes to quantify the quality of wine.

In white and red wine dataset, we have 4898 and 1599 data points respectively. They all contains same. Luckily, the data collection is well-defined with no missing data and prepare for analyst right away.

Bar Chart to show the quality rating of white and red

The picture below show the white and red wine quality rating. The top picture illustrates the white wine quality and the bottom illustrates the red wine quality. As you can see both have similar distribution, it is rare to see the wine have higher quality than 7 and lower quality than 5. This makes sense, in the world of wine making it require a lot of process, technique and class. To create such a bad wine is seemingly as hard as great wine. But to make a fine wine (5, 6 quality) is acceptable and common. At first, we might think making bad wine is hard or people are hardly make bad wine, but it is more likely that many people made bad wine. A lot of bad wine maker just went bankrupt and those who still have enough revenue stays in the business. In the other hand, the high quality wine maker might be rare or it is the capital intensity wine making process that those who have money can only produce wine. If we want to find out we need more inspection on wine making process by backward engineer on the chemical to reach the conclusion that capital is required to make great wine. Unfortunately, I would not like to drag the article to those wine making process.

Now look at the most correlated features with our target

The boxplot tells us the average(mean) of alcohol by each quality of both white(top) and red(bottom) wine, from this boxplot we can assume some relationship between quality and alcohol to be positive correlation and non-linearity. As you can see the alcohol seem to exhibit the negative relationship on wine quality 3, 4 ,and 5 while exhibits positive relation on 5,6,7, and 8. This give us the intuition that alcohol does a pretty job for prediction but will have a confusing threshold and overlapping of data around quality 3, 4, 5,and 6.

With this knowledge of the co-occurrence between the data, we know at least that machine learning will able to use this information as predictive power. However, the good predictive model should be able to perform at least better than null prediction( use most frequent class as predicted value for all data) or random prediction( use the probability distribution to randomly assign the predicted value). So let’s say our accuracy benchmark to whether the classifier is good or not should be more than 44.87 percent. This number tells the probability of someone with no knowledge about wine and give the most frequent quality that he/she found to any given wine.

The randomly assign predictor should have mean around the same as most-frequent predictor however, the randomness can exhibit some range of uncertainty to the prediction and giving a range of variation around mean accuracy instead of a single value.

Model

So the model just need to be better than 44 percent and that is the threshold that we can said model is done a pretty good job at prediction. I perform two model: Decision tree classifier and multinomial logistic regression.

The logistic regression gives us the accuracy of 52.34 which is 8 percent more than most-frequent predictor and Decision tree accuracy of 61.43 percent which is 17 percent more than most-frequent prediction.

With the feature importances of decision tree, the most influence variable is the alcohol at 16 percent followed by volatile acidity and free sulfur dioxide at 11 percent.

These are pretty good model however, it should be able to explain the phenomena which is needed to make better wine quality. And that will be explained in next week for other model and data.

CODE

Reference

Residual sugar: https://www.decanter.com/learn/residual-sugar-46007/

Difference between Red and White wine https://winefolly.com/tips/red-wine-vs-white-wine-the-real-differences/

Sulfur dioxide: https://www.laboratoire-obst.com/so2-total-en.html