Classifying the quality of red wine: from gathering data to pipeline creation and deployment

João Lucas Farias
5 min readMay 31, 2022

--

This project was developed with the collaboration of Júlio Freire.

How to tell the quality of a wine without tasting it?

Bottle Shock (2008)

Wine is a historical drink with evidences of its existence dating from 4000 BC. Since then, humanity has been drinking wine and trying to improve the quality and taste of this ancient drink. Throughout this time, wine experts have specialized in labeling wines through smell, consistency, flavor and many other characteristics.

One way of learning how to tell the quality of a wine without a wine expert is using pre-existing technical information. In our case, we used the Red Wine Quality dataset with several physiochemical features of wines extracted from Kaggle.

Motivation

This project is a study case to exercise and validate the knowledge acquired during the first part of the Machine Learning Graduate level course taught by Professor Ivanovitch Silva at the Federal University of Rio Grande do Norte (UFRN), Brasil. This dataset featuring wine quality was chosen because of the mutual interest in wine of both participants of the team, Júlio Freire and I.

About the Data

As mentioned before, the dataset was taken from Kaggle: Red Wine Quality Dataset. The dataset has almost 1.6k instances (or rows) and 12 features (or columns) including the target label, ‘quality’. All features are explained below.

Fixed Acidity — A feature of wine, it is the level of acidity produced by some acids.

Volatile Acidity — A feature of wine, it is a measure of acetic acid. High levels can bring an unpleasant vinegar taste.

Critic Acid — A feature of wine, it is the amount of critic acid which that can bring “freshness” to the wine.

Residual Sugar — A feature of wine, it is the amount of sugar remaining after the fermentation process.

Chlorides — A feature of wine, it is the amount of salt in the wine.

Free Sulfur Dioxide — A feature of wine, it is the amount of free SO_2, which prevents microbial growth and oxidations.

Total Sulfur Dioxide — A feature of wine, it is the amount of free and bounded SO_2.

Density — A feature of wine, it is the ratio of weight per volume.

pH — A feature of wine, it describes how much acid or basic the wine is.

Sulphates — A feature of wine, it is the amount of others sulphates present in the wine.

Alcohol — A feature of wine, it is the level of alcohol in the wine.

Quality — A target of wine, the quality based on all features above.

In this dataset, the target ‘quality’ is a discrete grade in the range from 3 to 8.

Explore, Transform and Load (ETL)

In this first stage, we used Google Colab to upload the raw dataset to Weights & Biases, analyzed the overall state of the dataset through Exploratory Data Analysis (EDA) and preprocessed the dataset. The EDA and preprocess steps go hand in hand. In the EDA, we identified duplicated row, missing values and searched for patterns in the data. Through statistical tools, we found correlation between features and the existence of outliers. Also, we noticed the target features was imbalanced with too few rows with quality between 3–4 and 7–8. In the preprocessing step, we removed the duplicated rows and transformed our target feature, introducing two groups of wine: the ones with quality less than 6.5 (median value) are the ‘bad’ ones and the ones with quality above 6.5 are labeled as the ‘good’ ones.

Data Check and Segregation

This step is as simple as its name: we used pytest to check if the data is in accordance with what we expect — for example — if the features are inside a specified range; and segregated the dataset in train and test sets. We used 70/30 ratio for train and test sets respectively. These artifacts (along with all others generated during this project) can be found at our W&B project, under the Artifacts section.

Train and Test

Here we trained our model with the classifier chosen for this project: Decision Trees. Since this project serves as an educational tool, we are not concerned with performance but with creating a full pipeline in which to train and test Machine Learning algorithms.

Before the training took place, split the original train set into train and validation sets using the same ratio as before. This is done because we train our model with the train set and validate its performance on the validation set. Next, we removed the outliers and encoded our target feature. It is worth stating that it is critical for the outlier removal to take place during this step (and only in the train set) so as to avoid data leakage. The encoding of the target feature transformed our categorical value of ‘bad’ and ‘good’ into 0 and 1, respectively.

Then, we created the pipeline with a numerical feature selector in order to normalize the numerical values of the dataset. MinMax Scale, Standard Scale (z-Score) or no scaling were chosen for training the model. Since there are no categorical features in the dataset (excluding the target variable), no categorical feature processing was needed. Afterwards, we used W&B sweep to perform hyperparemeter tuning. We varied the normalization method (MinMax, z-score and no scaling), the criterion for the Decision Tree (Gini or entropy) and the splitter for the Decision Tree (random or best). W&B ran all of the configuration setups and presented to us the model with the highest accuracy, which we exported as the best model. The label encoder used was also exported to W&B.

FASTAPI and CI/CD with Heroku

FastAPI is a rapid and easy way to deploy ML models in a webservice form in contribution with Heroku. With the Heroku and GitHub CI/CD integration, we can automatically update any modification in our project to the deployed model. To access our app, just follow this link.

Conclusion

After doing this project, we were able to develop a deeper understating of each stage of a pipeline creation and its deployment along with the good programming practices taught in class. The model was able to display an accuracy of 83% on the test set, delivering a good result.

The GitHub repository with all the files created in the project can be found here.

--

--

João Lucas Farias

Lecturer and Doctorate Student at Federal University of Rio Grande do Norte, Brasil. Passionate about physics, mathematics, engineering and music.