Logistic Regression

Published in

Analytics Vidhya

4 min readMar 10, 2020

There comes a time where you are given or have chosen a dataset that is categorical, and when you reach that point, you will quickly find out that linear regression isn’t as helpful as you might think.

When you are dealing with a categorical dataset you want to use Logistic Regression to get the best fit of your model. Logistic Regression can be used for binary data or even more complex extensions.

Dataset Example

Lets take a red wine dataset, which can be found here on Kaggle, with 12 columns. From this, we can define our dependent variable as “quality,” and the remaining 11 columns will be our independent variables, otherwise known as features.

Throughout our dataset there are continuous values, but that stops when we reach our dependent variable which appears to be a categorical variable.

Data Visualization

Seaborn Scatterplot of the independent variables

By doing a scatter plot of your independent variables you are able to see how your data is distributed. Generally you want your data to have a normal distribution with some linear distributions. Our visualization above shows some normal distributions with a lot more exponential distributions. We can get rid of the exponential distributions by performing the opposite equation, logarithms.

Note: A log transformation can be easy as calling one line of code

After transforming our data using log we should now take a look at how our data is distributed.

Seaborn Scatter plot of Log transformed data

The visualization above shows our newly transformed independent variables. It looks like our data has more normal distributions with some linear distributions which is exactly where we want to be.

The next step is to check the distribution of our dependent variable:

Seaborn distribution plot of the dependent variable

Checking for multicollinearity

The next step would be to find out if our data has multicollinearity(refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related). It is best to avoid multicollinearity at all cost because it can lower the accuracy of your model.

How do we check

We can use a heat map to see multicollinearity. If the value is greater than 0.4, for more many, then you have serious multicollinearity and you might want to think about dropping some features.

It looks like our data avoids multicollinearity pretty well so no features will have to be drop.

Regression

We looked at our data, transformed it, and checked for multicollinearity. We are now ready to do a regression model on our data.

We are calling the Logistic Regression class and instantiating it. We then fit our training sets and predict said training set of the independent variables. We will then use the score method to return the accuracy of our training set, and sadly it is not looking too good. We got a score of 58% this means that our model isn’t fitting the data very well.

How to improve model?

There are a number of ways of improving your model’s score from polynomial regression to regularization or even going back and doing a different transformation of your data. But for now we will do polynomial regression and see if we can get a better score. By doing a polynomial regression it will try to fit our data better with manipulating the curve a little.

The code above helps us fit a polynomial regression to our logistic regression, which improves the accuracy score to 62%. Yes I know that 62% isn’t something that you should be bragging about, but lets see how bad it might be by checking our R-Squared(A statistical measure that represents the goodness of fit of a regression model).

StatsModel

We can use another library called StatsModels which will list a variety of information about our regression model.

Both our R-Squared and Adj. R-Squared are at a value of 1.0. This means that our model is at its maximum fit and there are no residuals from our actual values to our predicted values.

Is there a better way?

YES! Most likely