Logistic Regression.

Swapnil Bandgar
Analytics Vidhya
Published in
13 min readMay 28, 2021

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

For example, given below from different input variables such as Temperature, Humidity and Wind our model can predict output of how whether can be.

Comparison to linear regression:

Given data on time spent studying and exam scores. Linear Regression and logistic regression can predict different things:

Linear regression: On a scale of 0 to 100, Linear Regression may help us predict the student’s test score. The predictions of linear regression are continuous (numbers in a range).

Logistic regression: We may be able to use logistic regression to determine whether a student would pass or fail. The predictions of logistic regression are discrete (only specific values or categories are allowed). The probability scores that underpin the model’s classifications can also be viewed.

Types of Logistic Regression:

Generally, logistic regression means binary logistic regression having binary target variables, but there can be two more categories of target variables that can be predicted by it. Based on those number of categories, Logistic regression can be divided into following types –

Binary or Binomial

In such a kind of classification, a dependent variable will have only two possible types either 1 and 0. For example, these variables may represent success or failure, yes or no, win or loss etc.

Multinomial

In such a kind of classification, dependent variable can have 3 or more possible unordered types or the types having no quantitative significance. For example, these variables may represent “Type A” or “Type B” or “Type C”.

Ordinal

In such a kind of classification, dependent variable can have 3 or more possible ordered types or the types having a quantitative significance. For example, these variables may represent “poor” or “good”, “very good”, “Excellent” and each category can have the scores like 0,1,2,3.

Assumptions in Logistic regression :

Logistic regression does not make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms — particularly regarding linearity, normality, homoscedasticity, and measurement level.

1. Logistic regression does not require a linear relationship between the dependent and independent variables.

2. The error terms (residuals) do not need to be normally distributed.

3. Homoscedasticity is not required.

4. The dependent variable in logistic regression is not measured on an interval or ratio scale.

However, some other assumptions still apply.

· Binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal.

· Logistic regression requires the observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.

· Logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.

· Logistic regression assumes linearity of independent variables and log odds. although this analysis does not require the dependent and independent variables to be related linearly, it requires that the independent variables be linearly related to the log odds.

· Logistic regression typically requires a large sample size. A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model.

For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10).

Different Regression Algorithms:

There are various kinds of regression techniques available to make predictions. These techniques are based on three metrics: The number of independent variables, type of dependent variables and shape of regression line.

1. Linear Regression

2. Logistic Regression

3. Ridge Regression

4. Lasso Regression

5. Polynomial Regression

6. Bayesian Linear Regression

7. Stepwise Regression

8. ElasticNet Regression

9. Support vector Regression

10. Random Forest Regression

11. Decision Tree Regression

Linear Regression:

Linear regression is a predictive modeling technique that finds a relationship between independent variable(s) and dependent variable(s) (which is a continuous variable).

When there is a single input variable (x), the method is referred to as simple linear regression. When there are multiple input variables, literature from statistics often refers to the method as multiple linear regression.

Logistic Regression:

Logistic regression is one of the types of regression analysis technique, which gets used when the dependent variable is discrete. Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables.

Logistic regression predicts the output of a categorical dependent variable. Hence, the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

The logistic regression is calculation is given by below equation:

Ridge Regression:

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. This method is used when the independent variables are highly correlated.

· Performs L2 regularization, that is penalty equivalent to square of the magnitude of coefficients.

· Minimization objective = LS Obj + α * (sum of square of coefficients)

Ridge is so powerful regression method where the model is less susceptible to overfitting.

Lasso Regression:

LASSO (Least Absolute Shrinkage Selector Operator), is quite like ridge, reduces the number of dependent variables, in a similar case of ridge regression, if the penalty term is huge, coefficients can be reduced to zero and make feature selections easier. LASSO uses L1 regularization technique.

LASSO is generally used when we have a greater number of features, because it automatically does feature selection. As an outcome, the coefficient value gets nearer to zero which does not happen in the case of Ridge regression.

LASSO adopts easy shielding (thresholding) and picks a subset of the covariates given for the implementation of the final model.

Below is the equation that represents the Lasso Regression method:

Polynomial Regression:

Polynomial Regression, the relationship between independent and dependent variables, that is X and Y, is denoted by the n-th degree.

It is a linear model as an estimator. Least Mean Squared Method is used in Polynomial Regression also. The best fit line in Polynomial Regression that passes through all the data points is not a straight line, but a curved line, which depends upon the power of X or value of n.

Polynomial Regression can be represented by below equation:

Bayesian Linear Regression:

Bayesian Regression is one of the types of regression in machine learning that uses the Bayes theorem to find out the value of regression coefficients. In this method of regression, the posterior distribution of the features is determined instead of finding the least-squares. Bayesian Linear Regression is like both Linear Regression and Ridge Regression but is more stable than the simple Linear Regression.

Stepwise Regression:

It is highly used to meet regression models with predictive models that are carried out naturally. With every forward step, the variable gets added or subtracted from a group of descriptive variables.

Forward selection performs the continuously adding variables to review the performance and stopped when no improvement is needed up to an extent.

Backward elimination includes the removal of variables at a time until no extra variables would be deleted without considerable loss. And bidirectional elimination is the blend of the above two approaches.

ElasticNet Regression:

It is the mixture of ridge and lasso regression that brings out a grouping effect when highly correlated predictors approach to be in or out in the model combinedly. It is recommended to use when the number of predictors is very greater than the number of observations.

Support vector Regression:

Support Vector Regression (SVR) uses the same principle as SVM, but for regression problems.

It tries to find a line/hyperplane (in multidimensional space) that separates these two classes. Then it classifies the new point depending on whether it lies on the positive or negative side of the hyperplane depending on the classes to predict.

Random Forest Regression:

Random Forest Regression is a supervised learning algorithm that uses ensemble learning method for regression. Ensemble learning method is a technique that combines predictions from multiple machines learning algorithms to make a more accurate prediction than a single model.

Decision Tree Regression:

It is one of the best and mostly used supervised learning methods are tree-based algorithms. They empower predictive modeling with higher accuracy, better stability and provide ease of interpretation.

Metrics for Logistic regression:

After we are finished building your model, there are multiple metrics will help us in evaluating model’s accuracy.

1. Confusion Matrix

2. F1 Score

3. Gain and Lift Charts

4. Kolmogorov Smirnov Chart

5. AUC — ROC

6. Log Loss

7. Gini Coefficient

8. Concordant — Discordant Ratio

9. Root Mean Squared Error

10. Root Mean Squared Logarithmic Error

11. R-Squared/Adjusted R-Squared

1. Confusion Matrix:

A confusion matrix is an N X N matrix, where N is the number of classes being predicted. For the problem in hand, we have N=2, and hence we get a 2 X 2 matrix. Here are a few definitions, you need to remember for a confusion matrix :

· Accuracy: the proportion of the total number of predictions that were correct.

· Positive Predictive Value or Precision: the proportion of positive cases that were correctly identified.

· Negative Predictive Value: the proportion of negative cases that were correctly identified.

· Sensitivity or Recall: the proportion of actual positive cases which are correctly identified.

· Specificity: the proportion of actual negative cases which are correctly identified.

2. F1 Score:

This is the harmonic mean of Precision and Recall and gives a better measure of the incorrectly classified cases than the Accuracy Metric.

We use the Harmonic Mean since it penalizes the extreme values.

To summarize the differences between the F1-score and the accuracy,

· Accuracy is used when the True Positives and True negatives are more important while F1-score is used when the False Negatives and False Positives are crucial.

· Accuracy can be used when the class distribution is similar while F1-score is a better metric when there are imbalanced classes as in the above case.

· In most real-life classification problems, imbalanced class distribution exists and thus F1-score is a better metric to evaluate our model on.

3. Gain and Lift charts:

Gain and Lift chart are mainly concerned to check the rank ordering of the probabilities.

Here are the steps to build a Lift/Gain chart:

Step 1: Calculate probability for each observation.

Step 2 : Rank these probabilities in decreasing order.

Step 3 : Build deciles with each group having almost 10% of the observations.

Step 4: Calculate the response rate at each decile for Good (Responders), Bad (Non-responders) and total

Lift / Gain charts are widely used in campaign targeting problems. This tells us till which decile can we target customers for a specific campaign. Also, it tells you how much response you expect from the new target base.

4. Kolomogorov Smirnov chart:

K-S or Kolmogorov-Smirnov chart measures performance of classification models. More accurately, K-S is a measure of the degree of separation between the positive and negative distributions. The K-S is 100, if the scores partition the population into two separate groups in which one group contains all the positives and the other all the negatives.

On the other hand, If the model cannot differentiate between positives and negatives, then it is as if the model selects cases randomly from the population. The K-S would be 0. In most classification models the K-S will fall between 0 and 100, and that the higher the value the better the model is at separating the positive from negative cases.

5. Area Under the ROC curve (AUC — ROC):

The biggest advantage of using ROC curve is that it is independent of the change in proportion of responders.

Let’s understand what ROC (Receiver operating characteristic) curve is. If we look at the confusion matrix below, we observe that for a probabilistic model, we get different value for each metric.

The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate. Following is the ROC curve for the case in hand.

6.Log Loss:

AUC ROC considers the predicted probabilities for determining our model’s performance. However, there is an issue with AUC ROC, it only considers the order of probabilities and hence, it does not consider the model’s capability to predict higher probability for samples more likely to be positive. In that case, we could us the log loss which is nothing but negative average of the log of corrected predicted probabilities for each instance.

7.Gini Coefficient:

Gini coefficient is sometimes used in classification problems. Gini coefficient can be straight away derived from the AUC ROC number. Gini is nothing but ratio between area between the ROC curve and the diagonal line & the area of the above triangle. Following are the formulae used :

Gini = 2*AUC — 1

8.Concordant- Discordant ratio: -

It is the one of the most important metrics for any classification prediction problem. Concordant ratio of more than 60% is a good model. This metric generally is not used when deciding how many customers to target. It is mainly used to access the model’s predictive power.

9.Root Mean Squared Error (RMSE): -

It is the most popular evaluation metric used in regression problems. It follows an assumption that error is unbiased and follow a normal distribution.

There are several key points to consider on RMSE:

· The power of ‘square root’ empowers this metric to show large number deviations.

· The squared nature of this metric helps to deliver more robust results which prevents cancelling the positive and negative error values.

· It avoids the use of absolute error values which is highly undesirable in mathematical calculations.

· RMSE gives higher weightage and punishes large errors with compared to mean absolute error.

RMSE error can be calculated as below:

Where, N is total number of observations.

10.Root Mean Squared Logarithmic Error: -

To calculate this, we take the log of the predictions and actual values. RMSLE is usually used when we don’t want to penalize huge differences in the predicted and the actual values when both predicted and true values are huge numbers.

· If both predicted and actual values are small, then RMSE and RMSLE are same.

· If either predicted or the actual value is big: RMSE > RMSLE.

· If both predicted and actual values are big: RMSE > RMSLE.

11.R-Squared: -

In the case of a classification problem, if the model has an accuracy of 0.8, we will gauge how good our model is against a random model, which has an accuracy of 0.5. So, the random model can be treated as a benchmark. In RMSE metrics, we don’t have a benchmark to compare.

The R-squared can be calculated as below:

MSE (model): Mean Squared Error of the predictions against the actual values.

MSE (baseline): Mean Squared Error of mean prediction against the actual values.

References: Analyticsvidhya, Upgrad, Statisticssolutions.

--

--

Swapnil Bandgar
Analytics Vidhya

Code is like humor. When you have to explain it, it’s bad. Connect with me on LinkedIn : https://www.linkedin.com/in/imswapnilb