DATA STORIES | WINE COLOUR CLASSIFICATION | KNIME ANALYTICS PLATFORM
Modelling Wine Color By Data Mining Physico-chemical Properties
A codeless approach to implement machine learning techniques
Abstract
I propose a data mining approach to predicting the color of wine using a range of common machine learning techniques. A big dataset is available, containing white and red Vinho Verde samples taken from Portugal. The input variables will be limited to physico-chemical only. However, these inputs will be sufficient for use in classification or regression models.
Introduction
Vinho Verde is a wine specially made in Portugal’s northwest region called Minho. It is medium in alcohol with refreshing characteristics. More details can be found at: http://www.vinhoverde.pt/en/ (Cortez, Wine Quality Datasets, 2021).
Acknowledgements
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modelling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547–553, 2009.
This dataset was originally donated to the UCI Machine Learning repository. You can learn more about past research using the data here.
1. Data Observation
Observing the dataset sample provided I see there are no missing values for any attribute and, there is a balanced number of records for both colors of wine. However, with further observation a few of the other attributes in the dataset look to have skew data, using Kurtosis we can compare the skewness of each attribute.
Table 1 represents the statistics of the full dataset, showing all the attributes distribution. Highlighted are the very skewed distributions, all of which are positively skewed. To look closer at individual distribution of each class of the target attribute, I split the dataset into two subsets and then check the statistics to find correlations between attributes, which in turn may help distinguish the color of a wine.
Tables 2 & 3 tell us a lot about what attributes play a key role in defining the color of the wine. I highlight the attributes that distinguish the difference between the two. This insight helps to determine the overall attribute weightings for the prediction model.
Furthermore, some very interesting correlations can be observed via 3D scatter plot in regard to the entire dataset. Figure 1 shows this 3D plot grouped by wine color of density, residual sugar, and alcohol where a distinct distribution pattern appears, when at first inspection, density and alcohol seemed not to have much significance. Figure 2 shows three violin plots where the distributions of data points between attributes can be observed. Using violin plots to compare all data point distributions of all attributes may help to identify more significant variables in the dataset when grouped by the target class.
2. Model Selection and Design
Based on my analysis of the variables in the dataset, there are three data mining (DM) models well suited to handle the input/output vectors.
1. Logistic Regression (LR)
2. Support Vector Machine (SVM)
3. Neural Network (NN)
The three models each have their own advantages. LR is a traditional technique for modelling continuous data. SVMs are well suited for non-linear modelling as well as being highly flexible. Paulo Cortez, Associate Professor at the University of Minho, Portugal says that “SVMs present theoretical advantages over NNs, such as the absence of local minima in the learning phase. In effect, the SVM was recently considered one of the most influential DM algorithms” (Cortez, Cerdeira, Almeida, Matos, & Reis, 2009).
Considering the advantages of SVMs, it will be the best suited model over NNs and LR models, however the LR model will be easier to interpret, hence I will choose the LR model for this dataset. I can achieve implementation of this model easily using the Logistic Regression Learner node in KNIME. The node performs a multinomial logistic regression with two options for solvers, iteratively re-weighted least squares, and stochastic average gradient (SAG). I will use the SAG solver as it is best suited for larger datasets.
Moreover, SAG requires the numerical data to be normalized into z-score form, that means the numeric data columns are normalized to have zero mean and a standard deviation of one. Z-score is useful when the actual min and max of attribute are unknown.
The SAG solver also optimizes the problem using maximum a posteriori estimation (Wikipedia, 2021) which allows to specify a prior distribution for the coefficients of the resulting model (KNIME, 2021). Under the Advance tab of the node configurations, I will proceed to use Laplace as the prior with a variance of 0.1, because it is related to the LASSO (a.k.a. L1 regularization). In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization to enhance the prediction accuracy and interpretability of the resulting statistical model (Wikipedia, 2021).
The inputs I will use to build the model will be all available variables within the training data subset, and the model will be tested using the test data subset where the target class has been removed in order to predict it.
3. Data Treatment
I applied several treatments to the raw dataset using KNIME, which makes it simple to prepare the raw dataset set into a usable one — ready for model building — with five nodes (in order of use):
Rule Engine
I re-named the colors of the wine to represent the full name rather than the first letter. This was purely a personal choice for ease of readability and aesthetics. The Rule Engine node provides easy logic for data manipulation. I applied the change with two simple lines of code to replace the data in the same column.
$Wine Type$ MATCHES “R” => “Red”
$Wine Type$ MATCHES “W” => “White”
Shuffle
The raw dataset is a subset of data made up from two datasets based on wine type (color), thus resulting in a subset where the top half is all red wine physico-chemical samples and the bottom half attributed to white wine physico-chemical samples. For this reason, it is essential that before partitioning the dataset is shuffled to create random order of all records within. I have chosen to use a seed in the configuration of the node, which I’ve set as 123456, the reason for this is, if I am required to recreate the same random order of the raw dataset, I can provide the same seed to get the same results. Figures 4 & 5 show the first five records before and after applying shuffle treatment.
Normalizer (PMML)
Discussed in section 2 above, the LR model requires all numeric data columns to be z-score (a.k.a. standard score) normalized. Computing the z-score involves the mean and standard deviation of the total population to which a data point belongs. It is calculated by subtracting the population mean from an individual raw score (i.e. an observed value or data point) and then dividing the difference by the population standard deviation (Wikipedia, 2021). The node performs a linear transformation such that the values in each column are Gaussian (0,1) distributed, i.e. mean is 0.0 and standard deviation is 1.0 (KNIME, 2021).
Furthermore, figures 7–9 visualize the normalized data points as box plots, where the dots represent mild outliers and the crosses extreme outliers. Figures 8 & 9 can help to visualize the differences in physico-chemical variations within each wine type.
Partitioning
The partitioning phase of building a predictive model is a very important step, as it will require two subsets one for training the model and one for testing it. Splitting the data for this requirement can be quite arbitrary, as there is no general rule. However, the full dataset must be split at least once into two subsets of one training and one test set at a ratio of typically 70:30, respectively.
The partitioning process is performed in a row-wise manner, so to get balanced subsets with even distribution of the target class I have used stratified sampling, where the distribution of values of the target class column is (approximately) retained in the output subsets. Also, I have enabled a fixed seed of 123456 to get reproducible results.
Column Splitter
The test data subset contains the answers to the prediction model, because of this, I must separate the answers from the test data as we cannot check the performance of the model if it already knows the answers. To remove the answers from the test data, I used a column splitter. This node is very easy and straightforward to use: just select the column(s) to include or exclude. In this case, I chose to include the Wine Type column into the bottom output, leaving the remaining columns in the top output.
The predicted answers will be compared with these answers by re-joining them to the output of the prediction data after the test data has been modeled. The comparison results will be looked at in greater depth in section 4.
4. Model Training/Creating and Test
Upon completing treatments of the data described in section 3, the model is now ready to be built. First the training data is feed into the learner node which in this case is the Logistic Regression Learner node. As discussed in section 2 concerning the design of the model, the solver configured for this node will be the SAG solver while Laplace prior will be used for the regularization in advanced settings. The target column will be set to Wine Type with reference category set to Red, the attributes to include in the model will be all numeric columns.
Moreover, in advanced settings I have made further optimizations: termination conditions will have maximal number of epochs set to 712 with epsilon as 1.0E-5. Maximal number of epochs refers to the maximum number of learning epochs that will be performed, which is essentially the max number of times the solver will iterate over the data. However, the solver will stop early if it reaches convergence, meaning it has found a good solution. Hence it is recommended to use a large maximal number of epochs. Epsilon is used to determine whether the model converged. If the relative change of all coefficients is smaller than epsilon, the training is stopped (KNIME, 2021).
Furthermore, relevant for the SAG solver only, learning rate and step size have been optimized. The learning rate strategy provides the learning rates for the gradient descent. When selecting a learning rate strategy and initial learning rate keep in mind that there is always a trade-off between the size of the learning rate and the number of epochs that are required to converge to a solution. With a smaller learning rate, the solver will take longer to find a solution but if the learning rate is too large it might skip over the optimal solution and diverge in the worst case (KNIME, 2021).
The learning rate for the model is best set as fixed with a step size of 0.01, resulting in 710 iterations to converge. Figure 10 below shows the calculated model coefficients and statistics, including the standard errors, z-score, and P>|z| values for the coefficients.
Testing the Model
After the model has been built, there are some preparations required before the testing is carried out. The built model must be connected with a Logistic Regression Predictor node, along with the test data feed in. Configurations of this node are simple checkboxes. I have checked all of them to obtain three new columns once the node has been executed. Predictions (color), which hold the answers of the model, will be later compared with the correct answers that I had removed earlier. The other two columns will each contain the predicted probabilities of the individual classes (e.g. P (Wine Type = Red) ). Later, I will also use these probabilities to compare them with the correct target values and generate a ROC Curve. ROC Curves help to evaluate the performance of a model relying on visual inspection. The greater the area under the curve, the better the model performance is.
Moreover, I join the correct answers back to the test set after the model has performed its predictions, and connect this final dataset with the performance evaluation nodes. The most significant performance can be found in the Scorer node, here we will find the overall accuracy, precision, recall, and the confusion matrix which shows the true and false positive and negative. Tables 5–7, show the performance results of the LR model using the test data subset of 600 records. Figure 11 shows the ROC Curves of both red and white class predictions.
Note. Besides implementing a correct data science pipeline and model training, the very high accuracy achieved by the model above is very likely to be attributed also to the quality and informativeness of the features in the dataset.
Conclusion
Positive results were reached, with the LR model classifying the wine color at an overall global accuracy of 99.67%, giving an overall global error rate of 0.33%. White wine classification outperformed red wine, with individual class statistics performance of 100% (white) vs. 99.34% (red), this was due to red wine having two false positive classifications. The LR models proved very accurate in this classification scenario, which can be very useful in real-world applications and problem solving.
Furthermore, this dataset shows very interesting differences in the physico-chemical makeup of individual wine types (color). This type of insight can help wine growers optimize their product in various ways, from the stage of growing the grapes, through harvest to fermentation of the wine.
References
Cortez, P. (2021, May 12). Wine Quality Datasets. Retrieved from Universidade do Minho: http://www3.dsi.uminho.pt/pcortez/wine/
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data miningfrom physicochemical properties. Guimaraes, Portugal: Department of Information Systems/R&D Centre Algoritmi, University of Minho.
KNIME. (2021, May 16). Logistic Regression Learner. Retrieved from KNIME Hub: https://hub.knime.com/knime/extensions/org.knime.features.base/latest/org.knime.base.node.mine.regression.logistic.learner4.LogRegLearnerNodeFactory4
Wikipedia. (2021, May 16). Lasso (statistics). Retrieved from Wikipedia.org: https://en.wikipedia.org/wiki/Lasso_(statistics)
Wikipedia. (2021, May 16). Maximum a posteriori estimation. Retrieved from Wikipedia.org: https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation