Multiple Linear Regression in Red Wine Quality

using IBM SPSS Statistics 26

Photo by Kelsey Knight on Unsplash

Introduction

Recently, the popularity of social drinking makes beverage manufacturers trying to improve the quality of the products produced, in this case, red wine. By developing red wine products, manufacturers hope that their products are more saleable or famous because the quality produced is better than other red wine products. However, to get red wine with the best quality, long research is needed to determine what factors have the most significant influence on the quality of red wine. Several statistical models and machine learning began to help beverage manufacturers determine the factors that can improve the quality of red wine and determine quality values.

In a study conducted by Kumar et al. (2020), three models are implemented to predict the quality value of red wine, namely random forest, support vector machine, and naïve Bayes. Of the three models, the support vector machine has a higher accuracy rate than other models (with training accuracy of 67.25% and test accuracy of 68.64%). Meanwhile, the other model, namely naïve bayes, has the lowest training and test accuracy with 55.91% and 55.89%, respectively.

Another study conducted by Nguyen (2020) also implemented three models to predict the quality value of red wine, namely multilinear regression with k-fold cross-validation, lasso, and random forest regression. This study concluded that random forest regression had the best results, with an R-squared of 48.50%, RMSE of 0.5843, and MAE of 0.422.

Problem Statement

For a manufacturer or producer of red wine, the attractiveness of customers in buying red wine products will be determined by its quality. The value or rating assigned to red wine is usually used to determine its quality.

However, various factors can influence the value or rating of red wine quality, so red wine manufacturers find it challenging to determine which factors will most impact red wine quality. As a result of these issues, red wine manufacturers need to understand the factors that affect red wine quality to enhance their products’ quality based on those factors. Furthermore, by improving the quality of the red wine produced, the manufacturers can gain more income from the sale of red wine that has been improved in quality.

Research Objectives

In this article, IBM SPSS is used to perform a learning method called Multiple Linear Regression (MLR) to determine the independent variable that most influences red wine quality (dependent variable). By using the stepwise method, the independent variables will be entered one by one into the model, making it possible to determine the independent variables that significantly affect the quality value of red wine.

In addition, this article will also prove several hypotheses. The hypotheses are as follows

1. Hypothesis 1 — Alcohol Percentage

H0: There is a significant positive correlation between quality and alcohol percentage in red wine

H1: There is a significant negative correlation between quality and alcohol percentage in red wine

2. Hypothesis 2 — Sulphates

H0: The more sulphates added, the higher the quality of the red wine produced (there is a significant positive correlation between red wine quality and sulphates).

H1: The more sulphates added, the lower the quality of the red wine produced (there is a significant negative correlation between red wine quality and sulphates).

3. Hypothesis 3 — Chlorides

H0: The lower the number of chlorides in the wine, the better the quality of the wine (there is a significant negative correlation between the quality of red wine and chlorides)

H1: The higher the number of chlorides in the wine, the better the quality of the wine (there is a significant positive correlation between the quality of red wine and chlorides)

Analysis Results and Interpretations

Dataset Description

The dataset used in this report is the Red Wine Quality dataset by UCI Machine Learning (2017), taken from the Kaggle website. There are 1,599 instances in this dataset with quality as the dependent variable and 11 independent variables. The following is a screenshot of the dataset with the data dictionary in the SPSS.

Dataset Preview
Data Dictionary

Stepwise MLR

Multiple Linear Regression Model Summary
ANOVA of Multiple Linear Regression
Independent Variables that Entered into MLR Model

MLR Assumptions

In MLR, five assumptions must be achieved for the model to provide an accurate prediction. The following are the assumptions:

  • Linearity: The dependent variable and independent variables should have linear correlations. The figures below are the scatter plot to show the linearity between the dependent variable (quality) and 11 independent variables. As shown in figure below, some independent variables (Fixed Acidity, Citric Acid, Sulphates, Alcohol, and Residual Sugar) significantly correlated with the dependent variable. In contrast, the other independent variables (Volatile Acidity, Chlorides, Free Sulfur Dioxide, Total Sulfur Dioxide, Density, and pH) negatively correlate with the dependent variable. It can be concluded that linearity assumption linearity is fulfilled.
Scatter Plot between Dependent Variable and Independent Variables
  • Normality: The errors (residues) are normally distributed. The figure below shows the P-P plot for the normality test. As seen in figure below, the points are tightly spaced around the trend line, suggesting that the residuals are normally distributed. Normality assumption can be concluded to be fulfilled.
Normality Test using P-P Plot
  • Homoscedasticity (constant variance): Constant or equal variance of dependent variable across independent variables. In homoscedasticity, if the scatter plot of residuals with predicted value does not show a clear pattern, then homoscedasticity is fulfilled. The figure below shows a scatter plot of residuals with the predicted value. From the figure below, it can be seen that the scatter plot does not show a clear pattern. In addition, it can also be seen that the gradient value of the trend line -4.55 * 10^-15 indicates that there is no correlation. So, it can be concluded that assumption 3 (homoscedasticity) is fulfilled.
Scatter Plot of Residuals with Predicted Value
  • Minimal multicollinearity: The independent variables must have a minimal level of correlation. If the VIF value is below 10, it can be said that no multicollinearity affects the MLR model. Figure below shows the independent variables that entered the MLR model in the last batch. From figure below, it can be seen that all of the variables in the MLR model in the last batch have a VIF value of less than ten. It can be concluded that multicollinearity assumption is fulfilled.
Variables that Entered MLR in Last Batch
  • Independence of Residuals: There should be no noticeable patterns in the residuals and the observation sequence. The Durbin-Watson test was carried out to test the assumption of independence of residuals. The Durbin-Watson statistical test evaluated the correlation between observation sequences and residuals (known as autocorrelation). The Durbin-Watson statistic value range between 0 and 4. If Durbin-Watson score of 2 or close to 2 indicates that the dataset has minimal autocorrelation. There is a positive autocorrelation if Durbin-Watson is less than 2 and a negative autocorrelation if Durbin-Watson is more than 2. Figure below results from the Durbin-Watson test in the last iteration of the MLR model, and it can be seen that the Durbin-Watson test value is 1.750, where the value is less than 2. It can be concluded that there is positive autocorrelation, which means that independence of residuals assumption is not fulfilled.
Model Summary of MLR in Last Batch

Model Summary Interpretation

In previous figure, shows that R (multiple correlation coefficient between three or more variables) is 0.600, which indicates that the correlation between the actual value and the predicted value is moderate positive. In addition, R^2, or the coefficient of determination representing the percentage of variance in a regression model’s response variable that the predictor variables can explain, is 0.359. From this value, it can be said that the value is relatively low because there is only a 35.9% variance in a regression model’s response variable explained by the predictor variables. Furthermore, the adjusted R^2 is often less than R^2; in this scenario, it is 0.357. Adjusted R^2 considers the sample size and number of predictors to more precisely estimate the population R^2.

Adequacy

The following can be used as the hypothesis for this ANOVA test:

H0: The model does provide an adequate fit.

H1: The model does not provide an adequate fit.

ANOVA of MLR in Last Batch

From figure above, it can be seen that the p-value obtained is 0.000, where the p-value obtained is less than 0.05. It can be concluded that H0 is accepted at a 95% confidence interval, and this MLR model is adequate for the data.

Hypothesis Test for the Coefficients of Regression

The hypothesis can be expressed as follows:

H0: There is a correlation between quality (‘a’) and independent variable (‘y’)

H1: There is no correlation between quality (‘a’) and independent variable (‘y’)

Independent Variables that Entered into the MLR in Last Batch

From figure above, it can be seen that the seven independent variables above have a p-value of less than 0.05. It can be concluded that H0 is accepted at a 95% confidence interval for the hypothesis test. In addition, from figure above, it can also be concluded that the seven independent variables correlate with the dependent variable.

Conclusions and Recommendations

Referring to previous figure, it can be concluded the model equation is:

.: y = 4430 + 0.289 (Alcohol) — 1.013 (Volatile Acidity) + 0.883 (Sulphates) — 0.003 (Total Sulfur Dioxide) — 2.018 (Chlorides) — 0.483 (pH) — 0.005 (Free Sulfur Dioxide)

By comparing the three hypotheses that have been mentioned in the research objectives with the results obtained in previous figure, it can be concluded that the three hypotheses are proven correct. From the table, it can be seen that Hypotheses 1 and 2 have positive coefficients. There is a correlation between quality and alcohol, and the more sulfates added, the better the quality of red wine produced. Moreover, Hypothesis 3 has a negative coefficient value, which means that the fewer chlorides added, the better the quality of the wine.

It can also be concluded that the final model provides an adequate fit. Moreover, four of the five assumptions of MLR were also proven to be fulfilled. However, this is what causes the low result of R, R^2, and adjusted R^2. To avoid assumptions that are not met in further experiments, it can be done by adding independent variables (dummy data) or transforming the variables in log form or exponential form.

References

The SPSS file and outputs are available on my GitHub here.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store