An intuitive guide to Linear Regression
Frequentist, Lasso and Bayesian Variable Selection using AutoStat®
Our objective to use linear regression to examine the alcohol content (%) of wine based on the cultivar of grape and additional chemical properties.
The data in this case study are measurements of wine characteristics over 3 different cultivars of grape. The wines are produced in the same region of Italy. The data is available at https://archive.ics.uci.edu/ml/datasets/wine and contains measurements on 13 different constituents of grapes, namely:
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
- Cultivar (replacing numeric representation in original data source)
Data Management and Data Splitting
The dataset is uploaded to AutoStat® via the Data Manager module. We proceed by reviewing the data and data types for any anomalies. We engage with the options in the AutoStat® data manager module and examine the data types and review the descriptive statistics.
Typically, when performing classification using Machine Learning; or predicting using statistical models, the data is split into a training and a test set. To facilitate our analysis, we split the data equally over the stratum of Cultivar, and select 80% (142) of the data to train the models with the remaining 20% (36) left to test the predictive ability of the model.
Although we are using a train and test data set in this example, we note that the data set is both small (178 records) and noisy (see visualisations below). Given this, from a data analytic point of view, we do not expect to get good predictions from any model we use, and are conducting the analysis to gain an understanding of which chemical characteristics and cultivars influence alcohol content in wine.
Visualising the data
Our objective is to examine which chemical measures are related to alcohol content in wine. We begin with a boxplot and add subgroups to compare the cultivars.
A comparison of the subgroups shows that Grignolini cultivar produces a wine that is lower in alcohol content than Barbera or Barolo. The boxplots below indicate that the data is not symmetrical about the median value. Density plots provide clarity regarding the spread of the population.
The other explanatory variables are all continuous. As such, we construct a scatterplot to examine any potential relationships.
It appears there is some linear relationship (albeit noisy), with alcohol content rising as Proline measures increase. The level of alcohol tends to plateau around 15%. This makes biological sense, as most wines we purchase are below 15%. In a follow up tutorial, we will examine the use of splines and transformations in linear regression to better model the biological realities of data.
The pair plot below compares three exploratory variables. This plot is useful in understanding the relationship between the outcome (alcohol content) and the measures we hypothesise may be related to the outcome.
From the graph, we notice that the relationship between Proline and Colour Intensity looks like there may be clusters, rather than a straight linear response.
From the scatter plot above, we see that Barolo has generally much higher proline levels then the other 2 wines, whereas Barbera and Grignolino have similar proline levels, but vary in colour intensity.
Regression Models
We conduct a series of linear regressions, including:
- Standard frequentist.
- Variable selection with Frequentist (Lasso) approach
- Bayesian (G-prior spike-slab) approaches.
We include all the chemical properties and cultivar as a main effects model only. We do not include any interactions between the variables. The model can be formally expressed follows:
Y=Xβ+ϵY=Xβ+ϵ,
where ϵ∼N(0,σ2)ϵ∼N(0,σ2).
The process of specifying the model in AutoStat® is simple and intuitive. Having selected our model framework and dataset we assign our Forecast (outcome) variable, i.e. alcohol, and explanatory variables , i.e. all other variables in the dataset.
AutoStat® allows for the selection of either the Frequentist or Bayesian modelling paradigms. We select Frequentist with No Regularisation.
The summary table of coefficients indicates that there is a relationship between some of the variables. People often make this assessment based on the p-value of the coefficient, which is an indicator of whether the value is significantly different to zero. The p-value is at the centre of the current debate on reproducibility in science.
In general, it is reasonable to look at the p-value, but not use it as an absolute bright-line decision making value. Below we consider more robust methods for variable assessment. Based on p-values, we may consider Cultivar and colour intensity are the variables most worth considering in terms of changes in alcohol content.
The coefficient for Barolo is 0.7246, indicating that the average alcohol content of this wine is higher than the base level, i.e. Barbera. The coefficient for Colour intensity is small (positive) in magnitude. How do we interpret this coefficient? An Increasing of 1 proline unit generates an increase in the alcohol content by 0.1243%. The variable impact charts provide a quick view of the change in alcohol content attributed to colour intensity. We see that the change in alcohol (%) being attributed to a change from 2 units of colour intensity to 13 units is approximately 1.5%.
Variable selection
It is recognised that including too many variables in the model can reduce the effectiveness of the procedure.
In the past, techniques such as stepwise where terms were either added or subtracted one at a time from the model, were very common. However, these techniques are now widely discouraged due to their frequent misuse and have been replaced by more efficient algorithms.
Lasso regression
One option for reducing the number of variables in the model is known as shrinkage. This method places a penalty in the minimization process. This model fits all the required explanatory variables and shrinks them towards zero.
Using the default parameters in AutoStat®, the Lasso model reduces the original 15 variables to 9. The non-zero coefficients are displayed in the table below. This is quite different from the p-value interpretation of 2 variables being “significant”. We can see that Colour intensity and the cultivar Barolo are still included in the model, and that the value of these coefficients has been slightly reduced due to the shrinkage the algorithm is designed to achieve. Examining the Table, we see all remaining terms in the model have been reduced in magnitude (except the constant).
Bayesian G-prior spike slab models
For explanatory power, we estimate the most probable models using a G-prior spike slab algorithm. The unknown parameters (β , σ) require a prior distribution to be specified to estimate their respective posterior distributions.
The G-prior provides an estimate of the parameter’s likely location, without requiring the specification of a correlation structure between regressors. This makes it a useful prior for model comparison. The default in AutoStat® is to set g equal to our sample size. Hence, the posterior distribution becomes data driven. The hyper-prior ~β~ is set at zero.
It may be the case that variables in the full model exhibit collinearity (see pairplot for example), making assessment of each variable’s contribution to the outcome problematic. Variable selection for dimensionality reduction, and then averaging over probable models, is a common technique for approaching this issue. The need to assess 2^k competing models, requires a strategy to traverse the space. AutoStat® uses stochastic search variable selection, where each of the 2^k models (Mγ) is associated with a binary vector γ where γj = 1 when βj is in the model and 0 otherwise.
As the model probability, (Mγ), is now another parameter to be estimated, we assign the prior as a uniform distribution for all models. The stochastic search algorithm of Marin and Robert (2007) is then used to determine the posterior probability of each model.
We can see from the Model Probabilities in the stochastic search panel that, in this case, we do not have a clear “winner” for the most probable model.
Although intuitively, practioners may expect this, and novice users may be disappointed, this is often perfectly acceptable. It is just saying that in this case study, there are several models that fit the data equally well.
However, when we delve into the coefficients, we see that the parameter estimates are generally smaller in magnitude than the standard linear regression estimates, as was the case in the Lasso approach.
Results Analysis & Comparison
So after running the three models (Standard Frequentist, Frequentist with Lasso variable selection, and Bayesian variable selection with a G-prior spike slab), which model has the greatest predictive power, and why?
Looking at the prediction metrics of all three models, it is clear to see that the Standard Frequentist regression model performed with the most accuracy, with Mean Absolute; Mean Squared; and Median Absolute Error values notably lower than the Lasso and Bayesian counterparts.
Comparison — Lasso vs. Bayesian vs. Frequentist
At this point, a practitioner could fairly ask “Why is this so? Shouldn’t the more sophisticated techniques produce better predictions?”
Looking at the pairplot of actual alcohol content vs the predictions from the 3 models, we notice something very interesting with the Lasso and the G-prior. There is a gap in the predictions that is filled in the standard regression prediction.
This is telling the analyst that although we have determined the variables which have the most influence on alcohol content (namely cultivar and colour intensity) either a) we haven’t accounted for an important variable in producing predictions within a certain range, or b) the small size of our dataset and the noise within has caused our test / train data to be unsatisfactory in it’s representation of the underlying population.
In terms of gaining insight, using the combination of modelling approaches enhances our understanding of the different chemical properties, and the affect they have on alcohol content, and it aids in making informed decisions in regards to the direction of future research by embracing uncertainty rather than a bright line rule.
See for yourself
Want to start running your own statistics and data science projects?
Head to our website for a free trial here: