Bayesian Approach in Logistic Regression

As the field data science are evolving, machine learning and deep learning are gaining attention from the crowds. Are we forgetting the fundamental knowledge lying behind these sophisticated methods? We are tending to call a complex but powerful model, a black box model’, because it is losing interpretability as it gains the power to give final desired output, but we are losing track of the development and some insight beyond those data. So, let’s back to the basic about statistics, its importance should not be fading.

Hshan.T
CodeX
8 min readAug 28, 2021

--

It has been long since my last time working a Bayesian analysis. Until recently, I was exposed to its application in logistic regression. What worth to be mention is that, coincidentally I came across a paper working on unsupervised natural language case by applying Bayesian concept in generative distribution at the same time. Let’s take a look at difference between classic and Bayesian analysis.

If you are from statistics background, whether it is pure statistics, analytics or actuarial like me, I believe you will not be strange with the terms ‘classical statistics’ and ‘Bayesian statistics’. First lesson about statistics in high school, we were asked, ‘What is probability?’. And, today we will be questioned in slightly more tricky way, ‘What do probability mean in frequentist statistics and Bayesian statistics.?’. Frequentist interprets probability as frequency of an event out of large number of trials for a repeatable experiment. Yet, Bayesian probability is a way of expressing our degree of belief about an event.

Back then during my degree study, we were exposed to theoretical application in treating parameters of a distribution as random variables, updating prior belief about parameters using data provided and hence drawing conclusion about predictive model with updated posterior distribution of its parameters. The overview of this topic is restricted to well-known and formally defined distribution. Until I came across the term Bayesian Machine learning, I always wonder its application in machine learning. My journey then started from here, a simple example about logistic regression using Bayesian statistics with Markov Chain Monte Carlo (MCMC) method without going through the detailed and more formal pipeline including deduplication and outliers inspection. Assuming dataset is free from those data quality issues.

Data Loading and Checking:

Target variable in dataset given is numeric data, but we aim to perform logistic regression here, where target variable supposed to take only values of 0/1. Let’s assume, a score above 6 is good wine (y=1). There is no missing value in dataset.

R1: Data loading and NA checking.

Data Summary and Classical Analysis (Generalized linear model, GLM):

Statistics summary:

Table 1: Statistics summary.

Standard Deviation of Variables:

Table 2: Standard deviation of variables.

Frequency Table of Wine Class:

Table 3: Frequency table.
Plot 1: Histogram of wine class.
R2: Statistics summary.
R3: Standard deviation.

Generalized Linear Model (Logistic Regression Model):

R4: GLM model.
R5: Confidence Intervals.

Ranges of values for variables varies a lot, from a minimum of 0.0136 to maximum of 283. Total.sulfur.dioxide reported standard deviation of 32.895 and density has standard deviation of 0.002, which are the maximum and minimum among all variables. There are 217 good wines (quality score >= 6.5) and 1382 wines not classified as good.

Assuming null hypothesis stating coefficient of variable k is zero, at significance level of 5%, variables pH, free.sulfur.dioxide and citric.acid have p-value of 0.822, 0.376 and 0.498 respectively. Thus, there is no statistical significant evidence to reject null hypothesis for these three coefficients. It may be removed from the model. In addition to that, obvious drop in residual deviance, comparing to null deviance which assume all coefficients to be zero except intercept, implies that saturated model trained is better fit.

Generalized Linear Model (Logistic Regression Model) Without Insignificant Variables:

R6: GLM Model for insignificant variables removed.
R7: Confidence intervals for insignificant variables removed.
Plot 2: ROC curve of model in R4
R8: Area under AUC of GLM in R4.

After removing insignificant variables, remaining variables are still statistically significant in this model. The model reports lower AIC of 890.08, it does not provide information of best model, but it indicates the better fit model out of all experimented. But there is very slight increment of 2 in residual deviance which means a very small amount of unexplained variation in the second model as compared to first model. ROC curve for both models are almost overlapped with similar amount of area under the curve, indicating same classification performance. Nonetheless, the first logistic regression model coefficients will be used for the Bayesian analysis.

R9: Probability of Good wine with varying total sulfur dioxide given other variables are held fixed.

Bayesian Analysis

Consider the logistic regression equation below:

We will be generating samples for coefficients of the linear model. Assuming independent distributed coefficients, taking prior distribution of coefficients as:

The log-likelihood equation in term of beta:

Find p in term of Beta and X then substitute into likelihood function. We then generate the posterior distribution of the vector of Beta using formulas above. Omitting derivation process here. To ease the computation as usual, we will transform it into log-likelihood function. A multivariate normal distribution which is impossible to derive analytically by deriving differential equation to find the maximization point. Hence, posterior samples are generated by MCMC sampling process, such that we believe the chain will converge to a point regardless of initialization state if it is ran long enough. Multiple initializations are tested to visualize how it influences MCMC sampling process, MH-sampling.

Table 4: Initializations for MH-algorithm.
R10: MH-algorithm and prediction of classes (step 5: Prediction)
R11: Simulations with different initializations.
R12: Plots of the chains simulated.
Plot 3: Markov chains simulated.

The ideal acceptance level suggested by experts is around 20–30% by tuning the parameter c in the code, which we use to scale the variance of multivariate normal distribution used in MCMC process.

All the four chains generated with different initializations converged within 10000 simulations, but there are some fluctuations in the beginning of the chains. Chains initialized with coefficient from frequentist analysis converge the fastest and the third initialization option appears to fluctuate the most among all four. Nevertheless, there are some coefficients obtained from Bayesian analysis do not convergence to the MLE estimate regardless of te initialization applied. Coefficients for intercept, fixed.acidity, volatile.acidity, residual.sugar, density, pH and alcohol do not convergence to classical estimates. If we set a very strange initial vector, the chains will take longer run to attain convergence, regardless of whether it converges to estimates obtained for frequentist analysis.

Plot 4: Barplots of target variable predicted for models with different initialization.

Probability of Good wine:

Table 5: Probability of good wine.
R13: Barplot and probability computation.

There is difference observed in the number of samples predicted as good for Bayesian model formed with different initializations. Third initialization give probability (0.0885) that is most distinct from original data (0.1570). The remaining three are close to each other but still slightly away from 0.1570.

MH-Algorithm with library metrop:

R14: Samples generation with library metrop() in R.
R15: Acceptance probability with metrop().

Running code above, you are expecting to obtain the same plots as Plot 3 above after attaining stationary state or convergence. There are some fluctuations in the beginning of the chains as compared to the former. This method is readily available and easy to apply in a single line of code, but the challenge is to tune the best scaling parameters for each coefficient, especially when we have a set of variables with large differences in degree of variation. Dataset for this project is the best example, the variation ranges from 0.00005 to 0.1. The situation is getting even complicated when we set initialization to be a vector of weird values, differ a lot from the possible coefficient. We might even face the problem to reach convergence, if the parameter is not properly tune. An example of such a situation is illustrated below. Instead of reversing the computation of variance of multivariate normal distribution given by c*omega in self-defined M-H algorithm to obtain parameter scale for metrop() function, scaling parameter is set to be a different matrix. Most of the chains do not converged even after a long run. An example shown below to illustrate importance of setting the right arguments in the function.

Plot 5: Failed Case with Impropriate Argument Set for Scaling.

Sometimes, readily available MCMC function ease the work to write a lengthy code, but it does not always give a good result and bringing negative impact if we do not analyze the result carefully before publishing. The most optimal or suitable approach depends on the data available to us. Rather than trying to take on the risk to define the function where we might expect some mistake along the process, we may make it simple by performing normalization or scaling on the data. But, do note that the model built is based on the transformed data. Hence, when we try to make any prediction in new data, transformation needs to be performed before substituting in the model.

Conclusion

Since the aim of this piece of work is to gain hands-on experience with Bayesian and MCMC, the experiment can be improved by splitting samples into training and testing dataset for more convincing performance evaluation. A further statistical inference that plays similar role to hypothesis testing by inspecting p-value in frequentist statistics is through computation of highest density interval. Searching for the shortest interval giving the specified confidence level, then check if the value you are tested is falling beyond the interval to draw conclusion.

Beyond all the debates between frequentist and Bayesian, I believe innovation and publication of every new idea is to overcome drawbacks of existing methods or to provide a different perspectives to look at the problem. The existing should not be completely replaced and wiped out. Choosing the right methods for every case boosting performance, you may discover gems behind the digits.

--

--