Classification of Red and White Wines using Logistic Regression and LDA
We consider a dataset related to red and white variants of the Portuguese “Vinho Verde” wine. The dataset can be downloaded from https://archive.ics.uci.edu/ml/datasets/wine+quality. Due to privacy and logistic issues, only physicochemical (inputs) and quality (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
We have merged the white and red wine data together with a label of type of wine. In this study, we will use this dataset to illustrate different classification tools. The vaeiables of interest are:
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol
All of these 11 variables are continuous variables.
Explore the data
In the first step of our data analysis, we explore about the characteristics of different variables used for classification.
#load the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
winedata = pd.read_csv('./data/winedata.csv')
#check the number of variables and number of observations
winedata.shape
There are 13 columns and 6497 rows of observations in the data. Let us a have a check on the first few rows of observations.
winedata.head()
The last column wine
is our response or dependent variables and the first 11 columns are our predictors or independent variables. We ignore the column quality for this study.
We first visualize the distributions of different predictor variables for two types of wines red/white.
features = winedata.columns[:11]
fig, axs = plt.subplots(4,3, figsize=(16,16))
for fcount in range(len(features)):
i = fcount//3
j = fcount % 3
sns.kdeplot(x=features[fcount], hue='wine',fill='wine', data=winedata, ax = axs[i,j])
axs[3,2].set_axis_off()
plt.subplots_adjust(hspace=0.4)
plt.show()
Observe that total sulfur dioxide
for red and white wines have very different distributions. sulphates
, pH
, density
and alcohol
have very similar distribution for both of these wine types. Other variables have some differences for the two types of wines. We can also perform independent sample t-test to compare the mean of each of these variables between the two wine types.
from scipy.stats import ttest_ind
def t_test(group1, group2, **kwargs):
'''
Tow independent sample t-test wrapper
'''
means1 = np.mean(group1, axis=0)
sd1 = np.std(group1, axis=0)
means2 = np.mean(group2, axis=0)
sd2 = np.std(group2, axis=0)
n1 = group1.shape[0]
n2 = group2.shape[0]
sd = np.sqrt(((n1-1)*(sd1**2)+(n2-1)*(sd2**2))/(n1+n2-2))
effect = (means1 - means2)/sd
statistic, pvalue = ttest_ind(group1, group2, **kwargs)
result = pd.DataFrame({'Mean_Group1':means1, 'SD_Group1': sd1, 'Mean_Group2': means2, 'SD_Group2': sd2, 'T_Stat': statistic,
'pvalue':pvalue, 'Effect_Size':effect})
return result
t_test(winedata[features][winedata['wine']=='White'], winedata[features][winedata['wine']=='Red'])
Due to the large sample sizes, all variables have significantly different mean for the two types of wines. However, the effect sizes for alcohol
and citric acid
are small.
We will use all variables in our classification.
First we create a training set and a test set from our data.
#Separate the predictors and the response first
X = winedata[features]
y = winedata['wine']
#Create train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
print(X_train.shape)
print(X_test.shape)
(5197, 11)
(1300, 11)
There are 5197 observations in the trainng data and 1300 observations in the test data.
print('White wines in Training Set:',np.sum(y_train=='White'))
print('Red wines in Training Set:',np.sum(y_train=='Red'))
print('')
print('White wines in Test Set:',np.sum(y_test=='White'))
print('Red wines in Test Set:',np.sum(y_test=='Red'))
White wines in Training Set: 3909
Red wines in Training Set: 1288
White wines in Test Set: 989
Red wines in Test Set: 311
Scale all feature variables between 0 and 1 for numerical stability.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
#Scale the training data
X_tr_scaled = scaler.transform(X_train)
#Scale the test data
X_te_scaled = scaler.transform(X_test)
Logistic Regression
Let us first define our dependent variable from the class labels of white and red wines. Let
Then the logistic regression model is given by
We estimate the parameters from the training data using maximum lilekihood estimation.
# Logistic regression using statsmodels
import statsmodels.api as sm
#Add a column of constants
X_train_c = sm.add_constant(pd.DataFrame(X_tr_scaled, columns=features))
y_train_c = pd.get_dummies(y_train, drop_first=True).values.reshape(-1,1)
#Fit a logistics regression model
lgmod = sm.Logit(y_train_c, X_train_c).fit()
print(lgmod.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 12
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 5197
Model: Logit Df Residuals: 5185
Method: MLE Df Model: 11
Date: Tue, 11 Apr 2023 Pseudo R-squ.: inf
Time: 09:03:09 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
========================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------
const 14.3260 1.380 10.377 0.000 11.620 17.032
fixed acidity 4.8891 3.128 1.563 0.118 -1.241 11.019
volatile acidity -9.1646 1.737 -5.276 0.000 -12.569 -5.760
citric acid 3.8020 1.616 2.352 0.019 0.634 6.970
residual sugar 67.5750 8.287 8.155 0.000 51.333 83.817
chlorides -14.5171 2.955 -4.913 0.000 -20.308 -8.726
free sulfur dioxide -17.8094 4.165 -4.276 0.000 -25.972 -9.646
total sulfur dioxide 21.5563 2.252 9.574 0.000 17.143 25.969
density -94.2842 10.975 -8.591 0.000 -115.794 -72.774
pH 2.5954 2.110 1.230 0.219 -1.541 6.731
sulphates -5.4941 2.418 -2.272 0.023 -10.233 -0.755
alcohol -12.5035 2.115 -5.912 0.000 -16.649 -8.358
========================================================================================
From the above results, fixed acidity
and pH
are not significant predictors. However, instead of inferential model building with logistic reegression, we use logistic regression as a classification tool.
from sklearn.linear_model import LogisticRegression
lgmodel = LogisticRegression(random_state=21, penalty='none')
lgmodel.fit(X_tr_scaled, y_train)
# Predict for test data
y_te_pr = lgmodel.predict(X_te_scaled)
Visualise the confusion matrix using a heatmap
Let us visualise the results of the model in the form of a confusion matrix using matplotlib and seaborn.
# Create a confusion matrix
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_te_pr)
#Create a heat map to visualize
labels = np.unique(y_train)
fig, ax = plt.subplots()
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g',
xticklabels=labels, yticklabels=labels)
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
We observe that only 3 white wines are misclassified as Red and 6 Red wines are misclassified as White. Let us now evaluate the fit using classification_report
for accuracy, precision and recall
from sklearn.metrics import classification_report
print(classification_report(y_test, y_te_pr, target_names=labels))
precision recall f1-score support
Red 0.99 0.98 0.99 311
White 0.99 1.00 1.00 989
accuracy 0.99 1300
macro avg 0.99 0.99 0.99 1300
weighted avg 0.99 0.99 0.99 1300
The overall accuracy is 99%, which is very good.
ROC Curves
ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR). It shows the trade-off between sensitivity and specificity.
To compute the ROC curves, we need the predicted probabilities from the fitted logistic regression models instead of predicted classes.
y_te_prob = lgmodel.predict_proba(X_te_scaled)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_te_prob, pos_label = 'White')
auc = metrics.roc_auc_score(y_test, y_te_prob)
plt.figure(figsize=(8,6))
plt.plot(fpr,tpr,label="AUC="+str(np.round(auc,4)))
plt.plot([0,1], [0,1], 'r--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve from Logistic Regression')
plt.legend(loc=4)
plt.show()
AUC score for this classification model is 0.9976. AUC score of 1 represents a perfect classifier, and 0.5 represents a worthless classifier.
Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) is a multi-class classifier. In this problem we have only two-classes as we have defined earlier. The simple classification rule is: Classify and observation with feature vector x to the class White (or 𝑌=1) if:
where
- 𝜇_1 , 𝜇_0 are the mean vectors of the features for classes White and Red, respectively.
- Σ is the common covariance matrix of the feature vectors.
- 𝜋1 and 𝜋0 are the prior probabilities of the classes White and Red, respectively.
All of thse parameters of the models are estimated using sample mean vectors and covariance matrices of the training data.
The underlying assumption for LDA to be optimal is that the distribution of the feature vectors in each class are normal (Gaussian). However, LDA can also be viewed as a dimensionality reduction technique, that is, the data is projected to a lower dimensional space using a new coordinate axes. Thus, even if the assumption of normality does not hold, we can use LDA as a classifier when it is not guaranteed to be optimal.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda_model = LinearDiscriminantAnalysis()
#Fit the model
lda_model.fit(X_tr_scaled, y_train)
# Predict for test data
y_te_pr = lda_model.predict(X_te_scaled)
#Visualise the confusion matrix
cnf_matrix = metrics.confusion_matrix(y_test, y_te_pr)
#Create a heat map to visualize
labels = np.unique(y_train)
fig, ax = plt.subplots()
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g',
xticklabels=labels, yticklabels=labels)
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
The performance of LDA is very similar to the logistic regression, we have observed earlier. Only 3 White wines are misclassified as Red and only 4 Red wines are misclassified as White.
print(classification_report(y_test, y_te_pr, target_names=labels))
precision recall f1-score support
Red 0.99 0.99 0.99 311
White 1.00 1.00 1.00 989
accuracy 0.99 1300
macro avg 0.99 0.99 0.99 1300
weighted avg 0.99 0.99 0.99 1300
#ROC Curves
y_te_prob = lda_model.predict_proba(X_te_scaled)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_te_prob, pos_label = 'White')
auc = metrics.roc_auc_score(y_test, y_te_prob)
plt.figure(figsize=(8,6))
plt.plot(fpr,tpr,label="AUC="+str(np.round(auc,4)))
plt.plot([0,1], [0,1], 'r--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve from Linear Discriminant Analysis')
plt.legend(loc=4)
plt.show()
We have observed that both logistic regression and linear discrimant analysis perform quite well as a classifier for white and red wines based on 11 physico-chemical characteristics. In our next study, we illustrate the use tree-based methods and support vector mechines to classify the red and white wines.