Preparing your Data before Modeling with Logistic Regression Classifier

Edward Nathan Wibisono
4 min readJun 18, 2023

--

Logistic Regression is a machine learning classification algorithm, despite of it’s “Regression” name, that is used to predict a probability of an event occurring. It is called “logistic” regression because of the use of logistic function (also known as sigmoid function) to do the prediction.

For the background, Logistic Regression Classifier has some advantages that make it popular among the other classification algorithm. Those strong points are:

  • Interpretability: Logistic Regression can provide an interpretable results. The coefficients estimation value represent the effect / influence each feature to the response variable. By knowing each feature coefficient, we can simply see which feature has the most impact on the classifier model outcome.
  • Low Risk of Overfitting: Logistic Regression has low risk of overfitting, as it has limited number of parameters and the model’s simplicity compared to other algorithm such as decision tree that is more complex. With a small number of parameters, the algorithm is less likely to capture noises or irrelevant patterns in the data.
  • Efficiency: Logistic Regression is efficient when it comes to handling a large datasets with many features. It’s computation is relatively quick, which makes it suitable for dealing with a large-scale data set.The time required to tune a logistic regression model is also lower compared to other algorithm with a more complex model.

However, those advantages comes with assumptions that has to be met in order to use a logistic regression model. In this article, you will find 4 things you need to prepare with your data, to meet some of the algorithm’s assumptions, before modeling using Logistic Regression Classifier. Let’s get started!

1. The Target is Binary

As Logistic Regression is a classification algorithm, it is a mandatory to have the Target / Response variable value in Binary. For example:

  • Yes or No
  • Churn or Not Churn
  • Male or Female

How to know if my Target Variable is Binary? We can simply do the checking by counting how many unique values in the response variable. It is considered as Binary if the unique value is two, no more, no less.

2. There are No Outliers

Outliers tend to influence the model performance, in a negative way. It can decrease the algorithm / model performance as the model tends to overfit the data, makes the model’s ability to accurately predict or produce correct classification, in the test data set or when it is deployed, can be questioned.

How to know if my data set contains Outliers? We can simply check it by calculating it’s Z-Score (how many standard deviation a data point is from it’s mean). With the assumption of, if a data point has Z-Score that is more than 3, it is considered as an outlier, we can detect an outliers on the data easily. Outliers can also be detected manually by plotting numerical feature using histogram plot. If there are any data points with value way greater or way smaller than the average or the rest of the data, and have small value count in the histplot, it can be considered as an outliers.

3. The Features are Independent

Features in the data must be Independent, meaning, the explanatory variable must not come from measurement of another feature or related to each other.

How to know if my Features are Independent? We can simply check it by doing manual checking whether a feature comes from another feature and by plotting a correlation plot between each feature. If there is one or more feature(s) that has high correlation value with another feature(s), it can simply be said that the features are not Independent.

4. There is No Multi-Collinearity between Features

There must not be Multi-Collinearity between explanatory variable. The definition of Multi-Collinearity is when two or more features are correlated to each other / having a high correlation value, in a way that each and every one of those feature does not represent some unique value or information.

How to know if my Features has Multi-Collinearity? By checking the VIF (Variance Inflation Factor) value, which calculates the correlation and it’s strength of each feature. The higher the VIF, the greater the correlation is between the features.

To conclude, it is a must that you fulfill these four assumptions to be able to use a logistic regression classifier for your machine learning model. If one or more of the assumptions above are not met, it would be recommended to use other machine learning classification model. Hopefully, this article will help you, readers, to prepare your data before fitting it to your Logistic Regression Classifier model. Thank you for taking your time to read my article!

--

--