Machine Learning for Wine Lovers: Building a Classification Model for Wine Quality — Part 2

7 min readSep 14, 2023

This is the second part of the project on wine quality where I take you through the steps to build a machine learning model that can predict the quality of wine using different classification algorithms.

Please also refer A Data Science Approach to Wine Tasting: Exploring the Wine Quality Dataset — Part 1

EDA recap

The EDA of the Wine quality dataset has given us enough insights into the data that will enable us to now build our Machine Learning model. To recap,

There were originally 13 columns and 6497 rows. The ‘type’ column consisted of the 2 types of wines, Red and White. This column was one- hot encoded into two columns ‘Red’ and ‘White’ using
get_dummies( ) with values set as 0 (absence) and 1 (presence).
The ‘type’ column was dropped to minimise redundancy.
Missing values were found in the dataset, but they were too few to affect the analysis, so they were removed.
The resultant dataset consisted 14 columns and 6463 rows.

Target Variable

The target variable is the ‘quality’ column which has values from 3 to 9 in ordinal progression. That is, 3 in this dataset will be the lowest quality , while 9 is purportedly the highest quality.

Most of the wines have a quality of 5 or 6, and very few have a quality of 3 or 9. The data is highly imbalanced which can make it hard to compare or measure the quality of different wines.

Binarizing the quality column

In the Machine Learning Classification problem, our goal is to assign a binary output to each input. For example, to classify whether a wine is good or not based on its features. To do this, it would be necessary to binarize the quality column in the dataset, which contains numerical ratings from 1 to 10. By binarizing the quality column, the problem can be simplified to be able to use classification algorithms that require a binary output.

However, quality ratings in the dataset are not objective or universal measures of good or bad wine quality. They may depend on various factors such as personal preferences, food pairings, or cultural backgrounds.

Therefore, it would be useful to choose a reasonable threshold for binarizing the quality column that reflects our specific application and domain knowledge. In this case, I decided to keep the threshold at 5 and label any wine with a quality rating above 5 as “high” quality and any wine below 5 as “low” quality. This is purely my own choice and it may not be suitable for other purposes or contexts.

The dataset then looks like this, with a new column called ‘quality_binary’.

With the number of wines in category 1 are 4091 and category 0 are 2372, the data is still imbalanced.

Coping with imbalanced data

Imbalanced data is a common problem in machine learning classification. There are several tactics that can be used to combat this issue. One of the most effective methods is to use tree-based models such as decision trees, random forests, or gradient boosted trees. These models can handle imbalanced data well because they can split the data based on different criteria and create subgroups that are more balanced. They can also capture complex interactions and non-linear relationships among the features.

X, y ,Train Test Split

In machine learning, ‘X’ and ‘y’ are used to represent the input features and target variable, respectively.

X contains the input features of the dataset. Each row represents a sample, and each column represents a feature.

y contains the target variable of the dataset. It represents the output or label that will be used to predict using the input features.

To apply classification, the data is split into training set and a testing set. The training set is used to train the model, while the testing set is used to evaluate its performance.

Preprocessing

Feature scaling

The wine quality dataset contains both numeric and categorical features. The categorical feature has already been one-hot encoded so it is only the numeric features so forth. The numeric features include metrics from objective tests such as acidity levels, pH values, and alcohol content, while the target variable is a numerical score based on sensory data.

In this project I have used the Feature Scaling as the first step. Since the data contains different numerical parameters for each feature, feature scaling transforms the input features to a common scale to ensure that they have similar ranges. This can help to improve the performance of some machine learning algorithms, such as those that use distance-based metrics as in classification.

Here I am setting a variable ‘scaler’ to MinMaxScaler() to scale the features. I am using this scaler since the numeric features are not guaranteed to follow a normal distribution. It will also prevent distortion of the dataset. One can also use Robust scaler as there quite a few columns with outliers.

Principle Component Analysis

In the next step, I apply PCA. In machine learning, PCA is a technique used to reduce the complexity of a dataset by identifying the most important features in the data and combining them into a smaller set of features. This can help to improve the performance of the model.

Here the pca is set with n_components =0.9. This is a way of telling the algorithm to keep enough principal components to explain 90% of the variance in the data. In other words, we are selecting the minimum number of principal components that capture most of the information in the data.

Model Training

I will now train each model using a different classifier, and we can check the efficacy of each and select the best model.

The make_pipeline function from the sklearn.pipeline module is used to create the pipeline object, which can be used to apply a sequence of transformations to the data, such as scaling, PCA, and the Classifier.

The fit method is used to fit the full pipeline to the training data while the score method is used to check the score of the full pipeline on the training data. The score represents the accuracy of the model on the training data.

Target Variable prediction

In this example, X_test represents the input features of the test data, while log_y_pred represents the predicted target variable. You can compare the predicted target variable with the actual target variable to evaluate the performance of the model.

Evaluating the model

We can apply various metrics to evaluate the performance of a machine learning model, such as accuracy, precision, recall and F1-score.

Accuracy measures the proportion of correct predictions out of the total number of predictions.
Precision measures the proportion of true positives out of the total number of predicted positives.
Recall measures the proportion of true positives out of the total number of actual positives.
F1-score is the harmonic mean of precision and recall.

We can get all these scores for the machine learning model using classification_report from sklearn.

In this case, the model has an overall accuracy of 0.96, which means that it correctly predicts the class of 96% of the instances. The precision and recall for class 0 are 0.96 and 0.93, respectively, while the precision and recall for class 1 are 0.96 and 0.98, respectively. The F1-score for class 0 is 0.95, while the F1-score for class 1 is 0.97.

Other Classifiers

The pipeline method can be used to train the dataset using other classifiers like Decision tree, Random Forest, K-Nearest Neighbors (knn), Support Vector Classifier (SVC) and Gradient boosting. You can create a pipeline object using make_pipeline module and specify the sequence of transformations and the classifier to be used in the pipeline.

Best models

Here is a comparison of the accuracy of different machine learning models on the wine quality dataset:

The random_forest and SVC classifiers achieved the highest accuracy, with a score of 97.58%. The knn model came a close second with 97.16%.

The logistic_regression classifier achieved an accuracy of 95.98%, while the decision_tree and gradient_boosting classifiers achieved accuracies of 94.38% and 96.91%, respectively.

In general, random_forest, knn and SVC classifiers are known for their ability to handle complex datasets and can be effective at capturing non-linear relationships between the features and the target variable.

On the other hand, logistic_regression may not be as effective at handling complex datasets or capturing non-linear relationships between the features and the target variable.

The accuracy of a machine learning classifier depends on several factors, including the quality and quantity of the training data, the choice of algorithm, and the hyperparameters used to train the model. The classifiers in my training model are using default hyperparameters. Performance of the models could be improved further by tuning the hyperparameters.

Concluding thoughts

In conclusion, this project has combined EDA with machine learning model training, to gain valuable insights into the wine quality dataset and identified several classifiers that can predict wine quality with high accuracy.

The whole code is available on machine_learning/winequality_prediction_ML.ipynb at main · deepacoder/machine_learning (github.com)