A Data Science Approach to Wine Tasting: Exploring the Wine Quality Dataset -Part 1

8 min readSep 13, 2023

This is a classic Machine Learning project using a dataset that enumerates wine features of red and white wines, and a target variable denoting “Quality”. I have divided this article into 2 parts , Part 1 will cover the EDA of the dataset, and Part 2- will cover the Classification project steps.

Please also refer Machine Learning for Wine Lovers: Building a Classification Model for Wine Quality — Part 2 | by Deepa Pandit | Sep, 2023 | Medium

The Dataset

The wine quality dataset was created by Paulo Cortez and his colleagues from the University of Minho, Portugal. The wines covered in the dataset are from the Vinho Verde region of Portugal. The goal of creating this dataset was to model wine quality based on the physicochemical features and sensory tests while exploring the potential of machine learning techniques for this task.

The physicochemical tests measured various properties of the wines, such as acidity, alcohol, pH, etc. The sensory tests assigned a quality score to each wine sample, from 0 (very bad) to 10 (very good). The aim of this paper was to use machine learning techniques to predict wine quality based on the physicochemical features. The utility of such a model would be seen as beneficial to wine certification entities, wine producers and consumers as well as professional or hobby wine tasters.

Classification or Regression ?

According to the creators of the dataset, the data can be used for both classification and regression tasks, depending on how the quality score is treated. If the quality score is considered as a discrete variable, then it is a classification task. If the quality score is considered as a continuous variable, then it is a regression task. In this project, I have used the quality score as finite and applied Classification techniques.

What does the data reveal?

Applying the pandas tool .info() reveals the features of the data:

It consists 13 columns and 6497 rows.
The ‘type’ column is categorical denoting two types of wines, Red and White.
The ‘quality’ column is ordinal indicating intrinsic value in the number values.
The other 11 columns are numerical consisting of float and integers.
There are some null columns in the dataset, probably information lost in processing, but not significant enough to influence the data.

Data preparation

I have dropped the null columns in this dataset, and proceeded to one-hot encode the ‘type’ column from a categorical to a numerical variable using the get_dummies() method.

As a result, two new columns are created for ‘red’ and ‘wine’ with indicating presence (1) or absence (0) of the type of wine in the column. The ‘type’ column is dropped as it is unnecessary for further analysis. We can drop the ‘type’ column because it does not add any new information to the data set. Moreover, dropping the type column can help to avoid multicollinearity, which is a problem that occurs when two or more variables are highly correlated and can affect the accuracy and stability of the model.

Data distribution and bivariate analysis.

Data distribution

Data distribution can help to identify the shape, center, spread, and outliers of a variable. It can also help to determine the type of distribution, such as normal, skewed, or bimodal. Knowing the distribution of a variable can help to choose the appropriate statistical tests and models for analysis or inference.

In this dataset, the variables “fixed acidity”, “volatile acidity”, “citric acid”, “residual sugar”, “chlorides”, “free sulfur dioxide”, “total sulfur dioxide”, “density”, and “sulphates” are all right-skewed, meaning that they have a long tail on the right side and most of the values are concentrated on the left side. This indicates that these variables have some outliers or extreme values that are much higher than the rest of the data.
The variable “pH” is approximately symmetric, meaning that it has a bell-shaped curve and the values are evenly distributed around the center. This indicates that this variable follows a normal distribution, which is a common assumption for many statistical tests and models.
The variable “alcohol” is left-skewed, meaning that it has a long tail on the left side and most of the values are concentrated on the right side. This indicates that this variable has some outliers or extreme values that are much lower than the rest of the data.
The variable “quality” contains values ranging from 3 to 9, with higher values indicating better quality. Most wines are in the generic quality value between 5 and 7, with most wines containing a rating of 6.

Lets take a look at the distribution of each feature in the data:

Fixed acidity : This is the sum of the organic acids that are present in wine, mainly tartaric, malic, citric and succinic acids. These acids contribute to the taste, color, and stability of the wine. The average level of fixed acidity in wine ranges from 4 to 12 grams per liter (g/L), depending on the grape variety and the climate. The data is within the average range.
Volatile acidity : This refers to the amount of acetic acid that is produced by bacteria or yeast during fermentation or aging. Acetic acid contributes to the level of pungency and sourness. The average level of volatile acidity in wine should ideally be less than 0.4 g/L. Volatile acidity is considered a wine fault when it exceeds the sensory threshold of 0.6 to 0.9 g/L for red wines and 0.5 to 0.7 g/L for white wines. Our data has wines exceeding these thresholds.
Citric acid: The plot shows wines with 0 amount of CA, however CA is not very important to wine quality, as it is only present in small amounts in grapes and is mostly consumed by bacteria during fermentation.
Residual sugar : This is an important component of wine, as it balances out the sourness caused by acids and enhances the taste of wine. The level of sugars balanced by acidity makes the wine dry, off-dry, sweet and so forth. This visual helps understanding this perfectly. Sugar in Wine — Grape to Glass (grape-to-glass.com). Our dataset has wines that range from bone-dry, dry to sweet.
Chlorides : Chlorides are not very important to wine quality, as they are usually present in low levels and do not have a significant impact on the taste, color, or stability of the wine.
Free and Total Sulfur dioxides : Important in wine as they add freshness and crispness to wine.
density: The mean density in the dataset is 0.99, which makes the wines in the ideal range of density.
pH : A very important factor contributing to wine quality as it affects the acidity of wine. Highly acidic wines with a pH value of less than 3 need other features to balance it out, while the ideal range for wines remains 3.2 to 3.4 with the right amount of acidity. Beyond this value, wines become less acidic and may need additives to increase robustness. There are some highly acidic and less acidic wines in the dataset.
Sulfates : Wines with lower acidity need more sulfates than higher acidity wines. At pH 3.6 and above, wines are much less stable, and sulfates are necessary for shelf-life.
Alcohol : The mean alcohol level is 10.4 which is moderately low alcohol content for wines. The average alcohol content of wine ranges from 11% to 13% ABV, but some wines can have as low as 5.5% ABV (such as Moscato d’Asti) or as high as 23% ABV (such as fortified wines like Port or Sherry). The alcohol content of wine is not a direct indicator of its quality, but rather a result of various factors such as the style of wine, the quality level, the climate where the grapes grow, and the winemaking decisions. Different wines have different optimal alcohol levels depending on their grape variety, acidity, sweetness, tannin, and balance. Some wines may taste better with higher alcohol, while others may benefit from lower alcohol.

Bivariate analysis

It is useful to make a bivariate analysis of the data that examines the relationship between two variables. For example, to understand the relationship between alcohol content and quality, a bivariate analysis can point out correlation, causation and distribution of mean and medians of features across each quality value.

The following boxplots and scatterplots show how the quality of wine varies with each of its physicochemical properties. We can draw some important inferences from these graphs, such as:

7–9 quality wines have a lower volatile acidity which is desirable in a good quality wine.
Mean Residual sugar is very low and most wines lie in the range of 8.1, which indicates the presence of largely dry wines.
Density is ideal among more than 75% of the wines.
pH levels are within the range of ideal of 3.2 and 3.6, barring outliers. Quality 5 ,6 and 7 wines with outliers of pH values beyond 3.6 consequently shows the presence of higher sulphate content to add acidity.
7 -9 quality wines show higher contents of alcohol, though it is still low alcohol level for wine in general.

My insights from the dataset

It is generally accepted among Sommeliers that there is not just one or two features or factors that contribute to the quality of wine. The wine quality dataset is based on the subjective evaluation of wine experts, who rated the wines on a scale from 0 to 10 based on sensory attributes such as appearance, aroma, flavor, and mouthfeel. The quality ordinals do not necessarily reflect the objective quality of the wines, as different people may have different preferences and expectations for wine.

The dataset can help us understand how the physicochemical properties of wine affect its sensory attributes, such as appearance, aroma, flavor, and mouthfeel.

It can certainly help wine growers and producers to determine the choice of procedures to use to create a certain type of wine.

It can also enable us to create a classification or predictive model that can recommend wines that match different types of cuisines.

Therefore, the wine quality dataset does not convey the absolute or universal quality of the wines, but rather the relative or perceived quality of the wines according to a specific group of experts.

Acknowledgments:

The following resources were referred in writing this article :

A Data Science Approach to Wine Tasting: Exploring the Wine Quality Dataset -Part 1

The Dataset

Data distribution and bivariate analysis.

Written by Deepa Pandit