Interpretable AI: Logistic Regression

Shruti Misra
6 min readJun 15, 2023

--

In my previous post, I went over how to interpret Linear Regression models. However, linear regression models don’t work well for classification. Linear regression predicts continuous values by fitting the data to a hyperplane that minimizes the distance between the plane and the points. But, in classification we care about the probabilities of a point belonging to one category or another. Linear regression interpolates between points, which cannot be interpreted as probabilities. Further, linear regression can also output values below 0 and above 1, which are outside the range of what probability values are supposed to be in. So, there is no good way of distinguishing one class from another. A good explanation of why linear regression does not work for classification can be found here.

So, what does logistic regression do? Instead of fitting a line or hyperplane, logistic regression uses the logistic function to squeeze the output of a linear equation between 0 and 1, thus spitting out probabilities that can determine the class of an input. A logistic function is defined as:

So, how does logistic regression work? In classification, we care about probabilities, so instead of taking:

y = w₀+ wx₁ +…wxₚ,

we look at the probability that (y = a class)

P(y) = w₀+ wx₁ +…wxₚ,

However, in the above equation the value of the probability is not bounded between 0 and 1. To remedy this, we introduce the concept of “odds”.

But, odds are always positive and it is difficult to model a variable with restricted range (0, infinity]. Therefore, log of odds is taken so that the range becomes [-infinity, infinity], which is easier to model. The equation then becomes

where log(P/(1-P)) is the logit function. In this case, we want the probability P. Therefore, we take the exponent of both sides to solve for P. I won’t go into the details of how to solve the equation, but the end result ends up being the equation for logistic regression which is:

Interpretation

Interpreting log odds isn’t as obvious or intuitive as the weights of linear regression. Given the coefficient of an input feature (w), the odds ratio for that feature is exp(w). This means that the the feature associated with w changes the odds by exp(w). Another way of saying this is that a change in one unit of the input feature changes the odds by a factor of exp(coefficient associated with that feature). For example, say we want to classify an individual as obese (could be measured by weight, BMI etc) based on how many sodas they drink in a day . If the coefficient of soda drinks (w) = 1.70, then the odds ratio will be exp(1.70), which is 5.47. This can be interpreted as an increase in one soda drink per day increases the odds of the person being obese by a factor of 5.47. In simpler words, drinking one more soda drink a day will make you 5.47 times more likely to be classified as obese.

Example Time!

In this example, I used the Diabetes Prediction Dataset from Kaggle, to examine which independent variables (age, gender, bmi, hypertension, heart disease, blood glucose levels, HbA1C levels and smoking history) can help predict diabetes (y). The dataset consisted of a total of 100000 rows, which I split into a training (66%) and a test (33%) dataset. I used dummy encoding to encode some of the unencoded categorical variables (gender, and smoking history). I also removed rows of data associated with individuals who were categorized as “Other” gender (only 0.00018% of the dataset) and those with “No Info” or “not current” smoking history. I used the training dataset to fit a logistic regression model and analyze its coefficients.

Figure 1: Coefficients for the input features with strongly regularized logistic regression
Figure 1: Coefficients for the input features with strongly regularized logistic regression

The plot above clearly shows that the presence of heart disease and high HbA1C levels increase the odds of having diabetes by a factor of 5.51 and 12.37 respectively. Being a current smoker slightly increases the odds of having diabetes by 1.26. An interesting observation to make here is that the presence of hypertension seems to decrease the odds of diabetes slightly. This is strange because hypertension and diabetes are known to be comorbidities. Additionally, it has been found that males are at a higher risk of developing type 2 diabetes than females (but females end up having to deal with more complications due to type 2 diabetes. Of course, it’s not as cut and dry as males vs. females, other factors such as family history, diet, etc. also play a role). But, what the model is telling us is the opposite of generally known trends. As I was digging into this a little bit, it turns out that the logistic regression function in sklearn applies an ‘L2’ regularization penalty by default. I tried turning down the penalty strength (by increasing the C parameter of the model, you can also set the penalty to ‘None’) and obtained the following result (I set the ‘C’ parameter to 1e9).

Figure 2: Coefficients for the input features with weakly regularized logistic regression

The result is drastically different because in the above graph, heart disease has a slightly higher coefficient than HbA1C levels and the direction associated with the coefficients of hypertension and gender align with what was expected. In this case, the presence of heart disease and high HbA1C levels increase the odds of having diabetes by a factor of 9.58 and 9.03 respectively, while the presence of hypertension increases the odds of having diabetes by 1.39.

I tried using both versions of logistic regression (regularized and weakly regularized) to classify individuals in the test dataset as diabetes or not. The accuracies are not very different.

Mean accuracy for strongly regularized logistic regression: 95.83%
Mean accuracy for weakly regularized logistic regression: 95.78%

The key takeaway here is that while the accuracies are barely different in both cases, the interpretation are very different.

Conclusion

Logistic regression is a fundamental method for classification. It is easy to implement, fast and can be extended to multiple classes. However, like linear regression, they assume no multicollinearity among input features, assume a linear relationship between the input features and the logit of the output, and typically requires a large sample size. In this post, I went over how to interpret logistic regression models to understand why they predict what they predict. Logistic regression models are fairly easy to understand once you get a little intuition about how log odds work.

However, there are some challenges to interpretation as we saw when changing the regularization of the model. In the case of the strongly and weakly regularized models, which one is “correct”? The strongly regularized model provides a slightly higher accuracy, but the interpretation contradicts some of the commonly known facts about diabetes and its relationship with hypertension and gender. Is this an artifact of the dataset? So, in practice, which model should be used? This is where domain expertise is incredibly important. As a data scientist with no medical training, I can show the results of these models to a medical professional and ask them what they think. They might have probing questions about the dataset that I haven’t thought of which might guide model selection. When thinking about interpretability in AI, we need to keep in mind who will be interpreting the results of these models and how can we incorporate them in the “design process” of different AI models. After all, interpretability is contextual and subjective and requires the input of those who are well versed in the context that the AI models will function within.

Code

The notebook for this and the dataset can be found on my Github repository.

--

--