Classification of Red and White Wines using Logistic Regression and LDA

8 min readApr 11, 2023

We consider a dataset related to red and white variants of the Portuguese “Vinho Verde” wine. The dataset can be downloaded from https://archive.ics.uci.edu/ml/datasets/wine+quality. Due to privacy and logistic issues, only physicochemical (inputs) and quality (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

We have merged the white and red wine data together with a label of type of wine. In this study, we will use this dataset to illustrate different classification tools. The vaeiables of interest are:

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol

All of these 11 variables are continuous variables.

Explore the data

In the first step of our data analysis, we explore about the characteristics of different variables used for classification.

#load the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

winedata = pd.read_csv('./data/winedata.csv')

#check the number of variables and number of observations
winedata.shape

There are 13 columns and 6497 rows of observations in the data. Let us a have a check on the first few rows of observations.

winedata.head()

The last column wine is our response or dependent variables and the first 11 columns are our predictors or independent variables. We ignore the column quality for this study.

We first visualize the distributions of different predictor variables for two types of wines red/white.

features = winedata.columns[:11]

fig, axs = plt.subplots(4,3, figsize=(16,16))

for fcount in range(len(features)):
    i = fcount//3
    j = fcount % 3
    sns.kdeplot(x=features[fcount], hue='wine',fill='wine', data=winedata, ax = axs[i,j])

axs[3,2].set_axis_off()
plt.subplots_adjust(hspace=0.4)
plt.show()

Figure 2: Plots of distributions of the feature variables

Observe that total sulfur dioxide for red and white wines have very different distributions. sulphates, pH, density and alcohol have very similar distribution for both of these wine types. Other variables have some differences for the two types of wines. We can also perform independent sample t-test to compare the mean of each of these variables between the two wine types.

from scipy.stats import ttest_ind

def t_test(group1, group2, **kwargs):
    '''
    Tow independent sample t-test wrapper
    '''

    means1 = np.mean(group1, axis=0)
    sd1    = np.std(group1, axis=0)
    means2 = np.mean(group2, axis=0)
    sd2    = np.std(group2, axis=0)
    n1 = group1.shape[0]
    n2 = group2.shape[0]
    sd = np.sqrt(((n1-1)*(sd1**2)+(n2-1)*(sd2**2))/(n1+n2-2))
    effect = (means1 - means2)/sd
    statistic, pvalue = ttest_ind(group1, group2, **kwargs)
    result = pd.DataFrame({'Mean_Group1':means1, 'SD_Group1': sd1, 'Mean_Group2': means2, 'SD_Group2': sd2, 'T_Stat': statistic,
                          'pvalue':pvalue, 'Effect_Size':effect})
    return result
    
t_test(winedata[features][winedata['wine']=='White'], winedata[features][winedata['wine']=='Red'])

Figure 3: Summary Statistics of the features

Due to the large sample sizes, all variables have significantly different mean for the two types of wines. However, the effect sizes for alcohol and citric acid are small.

We will use all variables in our classification.

First we create a training set and a test set from our data.

#Separate the predictors and the response first

X = winedata[features]
y = winedata['wine']

#Create train-test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
print(X_train.shape)
print(X_test.shape)

(5197, 11)
(1300, 11)

There are 5197 observations in the trainng data and 1300 observations in the test data.

print('White wines in Training Set:',np.sum(y_train=='White'))
print('Red wines in Training Set:',np.sum(y_train=='Red'))
print('')
print('White wines in Test Set:',np.sum(y_test=='White'))
print('Red wines in Test Set:',np.sum(y_test=='Red'))

White wines in Training Set: 3909
Red wines in Training Set: 1288

White wines in Test Set: 989
Red wines in Test Set: 311

Scale all feature variables between 0 and 1 for numerical stability.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)

#Scale the training data
X_tr_scaled = scaler.transform(X_train)

#Scale the test data
X_te_scaled = scaler.transform(X_test)

Logistic Regression

Let us first define our dependent variable from the class labels of white and red wines. Let

Then the logistic regression model is given by

We estimate the parameters from the training data using maximum lilekihood estimation.

# Logistic regression using statsmodels
import statsmodels.api as sm

#Add a column of constants
X_train_c = sm.add_constant(pd.DataFrame(X_tr_scaled, columns=features))
y_train_c = pd.get_dummies(y_train, drop_first=True).values.reshape(-1,1)

#Fit a logistics regression model
lgmod = sm.Logit(y_train_c, X_train_c).fit()
print(lgmod.summary())

Optimization terminated successfully.
         Current function value: inf
         Iterations 12
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                 5197
Model:                          Logit   Df Residuals:                     5185
Method:                           MLE   Df Model:                           11
Date:                Tue, 11 Apr 2023   Pseudo R-squ.:                     inf
Time:                        09:03:09   Log-Likelihood:                   -inf
converged:                       True   LL-Null:                        0.0000
Covariance Type:            nonrobust   LLR p-value:                     1.000
========================================================================================
                           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   14.3260      1.380     10.377      0.000      11.620      17.032
fixed acidity            4.8891      3.128      1.563      0.118      -1.241      11.019
volatile acidity        -9.1646      1.737     -5.276      0.000     -12.569      -5.760
citric acid              3.8020      1.616      2.352      0.019       0.634       6.970
residual sugar          67.5750      8.287      8.155      0.000      51.333      83.817
chlorides              -14.5171      2.955     -4.913      0.000     -20.308      -8.726
free sulfur dioxide    -17.8094      4.165     -4.276      0.000     -25.972      -9.646
total sulfur dioxide    21.5563      2.252      9.574      0.000      17.143      25.969
density                -94.2842     10.975     -8.591      0.000    -115.794     -72.774
pH                       2.5954      2.110      1.230      0.219      -1.541       6.731
sulphates               -5.4941      2.418     -2.272      0.023     -10.233      -0.755
alcohol                -12.5035      2.115     -5.912      0.000     -16.649      -8.358
========================================================================================

From the above results, fixed acidity and pH are not significant predictors. However, instead of inferential model building with logistic reegression, we use logistic regression as a classification tool.

from sklearn.linear_model import LogisticRegression

lgmodel = LogisticRegression(random_state=21, penalty='none')
lgmodel.fit(X_tr_scaled, y_train)

# Predict for test data

y_te_pr = lgmodel.predict(X_te_scaled)

Visualise the confusion matrix using a heatmap

Let us visualise the results of the model in the form of a confusion matrix using matplotlib and seaborn.

# Create a confusion matrix
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_te_pr)

#Create a heat map to visualize
labels = np.unique(y_train)
fig, ax = plt.subplots()

# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g',
           xticklabels=labels, yticklabels=labels)
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

plt.show()

Figure 4: Heatmap of the Confusiuon Matrix from Logistic Regression

We observe that only 3 white wines are misclassified as Red and 6 Red wines are misclassified as White. Let us now evaluate the fit using classification_report for accuracy, precision and recall

from sklearn.metrics import classification_report

print(classification_report(y_test, y_te_pr, target_names=labels))

precision    recall  f1-score   support

         Red       0.99      0.98      0.99       311
       White       0.99      1.00      1.00       989

    accuracy                           0.99      1300
   macro avg       0.99      0.99      0.99      1300
weighted avg       0.99      0.99      0.99      1300

The overall accuracy is 99%, which is very good.

ROC Curves

ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR). It shows the trade-off between sensitivity and specificity.

To compute the ROC curves, we need the predicted probabilities from the fitted logistic regression models instead of predicted classes.

y_te_prob = lgmodel.predict_proba(X_te_scaled)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_te_prob, pos_label = 'White')
auc = metrics.roc_auc_score(y_test, y_te_prob)

plt.figure(figsize=(8,6))
plt.plot(fpr,tpr,label="AUC="+str(np.round(auc,4)))
plt.plot([0,1], [0,1], 'r--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve from Logistic Regression')
plt.legend(loc=4)
plt.show()

Figure 5: ROC Curve from Logistic Regression

AUC score for this classification model is 0.9976. AUC score of 1 represents a perfect classifier, and 0.5 represents a worthless classifier.

Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) is a multi-class classifier. In this problem we have only two-classes as we have defined earlier. The simple classification rule is: Classify and observation with feature vector x to the class White (or 𝑌=1) if:

where

𝜇_1 , 𝜇_0 are the mean vectors of the features for classes White and Red, respectively.
Σ is the common covariance matrix of the feature vectors.
𝜋1 and 𝜋0 are the prior probabilities of the classes White and Red, respectively.

All of thse parameters of the models are estimated using sample mean vectors and covariance matrices of the training data.

The underlying assumption for LDA to be optimal is that the distribution of the feature vectors in each class are normal (Gaussian). However, LDA can also be viewed as a dimensionality reduction technique, that is, the data is projected to a lower dimensional space using a new coordinate axes. Thus, even if the assumption of normality does not hold, we can use LDA as a classifier when it is not guaranteed to be optimal.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda_model = LinearDiscriminantAnalysis()

#Fit the model
lda_model.fit(X_tr_scaled, y_train)

# Predict for test data
y_te_pr = lda_model.predict(X_te_scaled)

#Visualise the confusion matrix
cnf_matrix = metrics.confusion_matrix(y_test, y_te_pr)

#Create a heat map to visualize
labels = np.unique(y_train)
fig, ax = plt.subplots()

# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g',
           xticklabels=labels, yticklabels=labels)
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

plt.show()

Figure 6: Heatmap of the Confusion Matrix from LDA

The performance of LDA is very similar to the logistic regression, we have observed earlier. Only 3 White wines are misclassified as Red and only 4 Red wines are misclassified as White.

print(classification_report(y_test, y_te_pr, target_names=labels))

precision    recall  f1-score   support

         Red       0.99      0.99      0.99       311
       White       1.00      1.00      1.00       989

    accuracy                           0.99      1300
   macro avg       0.99      0.99      0.99      1300
weighted avg       0.99      0.99      0.99      1300

#ROC Curves
y_te_prob = lda_model.predict_proba(X_te_scaled)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_te_prob, pos_label = 'White')
auc = metrics.roc_auc_score(y_test, y_te_prob)

plt.figure(figsize=(8,6))
plt.plot(fpr,tpr,label="AUC="+str(np.round(auc,4)))
plt.plot([0,1], [0,1], 'r--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve from Linear Discriminant Analysis')
plt.legend(loc=4)
plt.show()

Figure 7: ROC Curve from Linear Discriminant Analysis

We have observed that both logistic regression and linear discrimant analysis perform quite well as a classifier for white and red wines based on 11 physico-chemical characteristics. In our next study, we illustrate the use tree-based methods and support vector mechines to classify the red and white wines.