Logistic Regression

Published in

The Startup

15 min readSep 10, 2020

By Neeta Ganamukhi

Department of Business and Economics

Abstract: The main aim of this term paper is to describe the Logistic Regression Algorithm, a supervised model used for classification. The paper describes Logistic Regression for Machine Learning, types of Logistic Regression and hypothesis, multinomial and ordinal are not covered in this paper. The paper also covers Sigmoid function, Decision Boundary Data Preparation, Cost function, Gradient descent, Difference between linear and logistic regression, and Pros and cons of Logistic Regression.

Keywords — Logistic Regression, Sigmoid function,Cost function, Gradient descent etc.,

Introduction

Every machine learning algorithm performs best under a given set of conditions. To ensure good performance, we must know which algorithm to use depending on the problem at hand. We cannot just use one algorithm for all problems. For example: Linear regression algorithm cannot be applied on a categorical dependent variable. This is where Logistic Regression comes in. Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. It extends the idea of linear regression to situation where outcome variable is categorical. In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for success/yes) or 0 (stands for failure/no).

I. WHAT IS LOGISTIC REGRESSION IN MACHINE LEARNING?

Logistic Regression is the alternative to regression analysis to conduct when the dependent variable has a binary solution. Mathematically, a logistic regression model predicts P(Y=1) as a function of X. Instead of Y as outcome variable (like in regression), we use function of Y called the Logit a.k.a. log odds = log (P(positive)/P(negative)). Logit can be modeled as a linear function of the predictors. It can also be mapped back to a probability, which, in turn, can be mapped to a class. Logit is one of the simplest Machine Learning function that can be used for various classification problems such as spam detection, Diabetes prediction, cancer detection, Online transactions Fraud or not Fraud, Tumor Malignant or Benign etc.

II. TYPES OF LOGISTIC REGRESSION

Generally, logistic regression means binary logistic regression having binary target variables, but there can be two more categories of target variables that can be predicted by it. Based on those number of categories, Logistic regression can be divided into following types:

A. Binary or Binomial

B. Multinomial

C. Ordinal

A. Binary or Binomial Regression : In such a kind of classification, a dependent variable will have only two possible outcomes either 1 and 0. For example, these variables may represent success or failure, yes or no, win or loss etc. For detailed information, see reference[1]

B. Multinomial Regression : In such a kind of classification, dependent variable can have 3 or more possible unordered outcomes or the outcome having no quantitative significance. For example, these variables may represent “Type A” or “Type B” or “Type C”. For detailed information, see reference[2]

C. Ordinal Regression : In such a kind of classification, dependent variable can have 3 or more possible ordered outcomes or the outcomes having a quantitative significance. For example, these variables may represent “poor” or “good”, “very good”, “Excellent” and each category can have the scores like 0,1,2,3.For detailed information, see reference[3]

III. LOGISTIC FUNCTION

Logistic regression is named for the function used at the core of the method, the logistic function. The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

IV. WHAT IS SIGMOID FUNCTION?

In order to map predicted values to probabilities, we use the Sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities. The sigmoid curve can be represented with the help of following graph. We can see the values of y-axis lie between 0 and 1 and crosses the axis at 0.5.

The following equation is used to presents Sigmoid function: 1/1+e^-z .The classes can be divided into positive or negative. The output is the probability of positive class if it lies between 0 and 1. For detailed information, see reference[4] Now, let us understand the application of sigmoid function in non-linear classification.

V. LOGISTIC REGRESSION HYPOTHESES

When using Linear regression, we use a formula of the hypothesis i.e.

For Logistic Regression, there is a little modification in the equation i.e.

We have expected that hypothesis will give values between 0 and 1.

Thus, hypothesis for Logistic Regression can be represented as,

We know that, the simplest form of logistic regression is binary or binomial logistic regression in which the target or dependent variable can have only 2 possible outcomes either 1 or 0. It allows us to model a relationship between multiple predictor variables and a binary/binomial target variable. Hence, the above equation can be represented as,

Here,g is the logistic or sigmoid function which can be given as follows:

We can call a Logistic Regression a Linear Regression model, but the Logistic Regression uses a complex cost function called Sigmoid function instead of a linear function. The hypothesis of logistic regression limits this sigmoid function between 0 and 1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression. In case of logistic regression, the linear function is basically used as an input to another function such as 𝑔 in the above equation. For more detailed information, see reference[5]

VI. LOGISTIC REGRESSION PREDICTS PROBABILITIES

The idea behind logistic regression is straightforward: Instead of using Y directly as the outcome variable, we use a function of it, which is called the logit. The logit, it turns out, can be modeled as a linear function of the predictors. Once

the logit has been predicted; it can be mapped back to a probability.

To understand the logit, First, we look at p = P(Y = 1), the probability of belonging to class 1 (as opposed to class 0). In contrast to the binary variable Y, which only takes the values 0 and 1, p can take any value in the interval [0; 1]. However, if we express p as a linear function of the q predictors in the form

p =β0 +β1x1 +β2x2+ _ _ _ +βqxq;

it is not guaranteed that the right-hand side will lead to values within the interval [0; 1]. The solution is to use a nonlinear function of the predictors in the form

This is called the logistic response function. For any values x1; : : : ; xq, the righthand side will always lead to values in the interval [0; 1]. Next, we look at a different measure of belonging to a certain class, known as odds. The odds of Y belonging to class 1 are defined as the ratio of the probability of belonging to class 1to the probability of belonging to class 0:

This metric is very popular in horse races, sports, gambling, epidemiology, and other areas. Instead of talking about the probability of winning or contacting a disease, people talk about the odds of winning or contacting a disease. If, for example, the probability of winning is 0.5, the odds of winning are 0.5/0.5 = 1. We can also perform the reverse calculation: Given the odds of an event, we can compute its probability by manipulating equation

From the above equations, we can write the relationship between the odds and the predictors as:

The equation above describes a multiplicative (proportional) relationship between the predictors and the odds. Such a relationship is interpretable in terms of percentages, for example, a unit increase in predictor Xj is associated with an average increase of βj*100% in the odds (holding all other predictors constant).

Now, if we take a natural logarithm2 on both sides, we get the standard formulation of a logistic model:

The log(odds), called the logit, takes values from -∞ (very low odds) to ∞ (very high odds). A logit of 0 corresponds to even odds of 1 (probability =0.5). Thus, the final formulation of the relation between the outcome and the predictors uses the logit as the outcome variable and models it as a linear function of the q predictors. For more detailed information, see reference[6]

VII. DECISION BOUNDARY

We expect our predictors to give us a set of outputs or classes based on probability when we pass the inputs through a prediction function and returns a probability score between 0 and 1.For Example, consider we have 2 classes, male and female (1 — male, 0 — female). We basically decide with a threshold value above which we classify values into Class 1 and of the value goes below the threshold then we classify it in Class 2.

As shown in the above graph we can choose the threshold as 0.5, if the prediction function returned a value of 0.7 then we would classify this observation as Class 1(male). If our prediction returned a value of 0.2 then we would classify the observation as Class 2(female). For more detailed information, see reference[7]

Now that we know how to make predictions using logistic regression, let’s look at how we can prepare our data to get the most from the technique.

VIII. ASSUMPTIONS FOR LOGISTIC REGRESSION

1) The logistic regression assumes that there is minimal or no multicollinearity among the independent variables.

2) The logistic regression assumes that the independent variables are linearly related to the log of odds.

3) The logistic regression usually requires a large sample size to predict properly.

4) The Logistic regression which has two classes assumes that the dependent variable is binary and ordered logistic regression requires the dependent variable to be ordered.

5) The Logistic regression assumes the observations to be independent of each other.

IX. COST FUNCTION

Once the model is developed, the question arises how good our model is? In Machine Learning, cost functions are used to estimate the model performance. In other words, a cost function is a measure of how good/bad the model is in terms of its ability to estimate the relationship between X and Y. This is typically expressed as a difference or distance between the predicted value and the actual value.

We learnt about the cost function J(θ) in the Linear regression, the same cost function will not work for logistic regression. If we try to use the cost function of the linear regression in ‘Logistic Regression’ then it would be of no use as it would end up being a non-convex function with many local minimums, in which it would be very difficult to minimize the cost value and find the global minimum. This strange outcome is because in logistic regression we have the sigmoid function around, which is non-linear (i.e. not a line).

With the J(θ) depicted in the figure above. the gradient descent algorithm might get stuck in a local minimum point. That’s why we still need a neat convex function just like in linear regression, a bowl-shaped function that eases the gradient descent function’s work to converge to the optimal minimum point. For more detailed information, see reference[8].

The cost function used in Linear regression is given by,

Which can be written in a slightly different way as,

Now let’s make it more general by defining a new function:

We can rewrite the cost function for the linear regression as follows:

For logistic regression, the Cost function is defined as:

We can make it more compact into a one-line expression: this will help avoiding if/else statements when converting the formula into an algorithm.

Replace y with 0 and 1 and you will end up with the two parts of the original function. With the optimization in place, the logistic regression cost function can be rewritten as:

The above equation can be compressed into one cost function given by,

For more detailed information, see reference[8]

X. GRADIENT DESCENT

After finding the cost function for Logistic Regression, our job should be to minimize it i.e. min J(θ). Gradient Descent is an optimization algorithm that helps machine learning models to find out paths to a minimum value using repeated steps. Gradient descent is used to minimize a function so that it gives the lowest output of that function. This function is called the Loss Function. The loss function shows us how much error is produced by the machine learning model compared to actual results. Our aim should be to lower the cost function as much as possible. One way of achieving a low cost function is by the process of gradient descent. Complexity of some equations makes it difficult to use, partial derivative of the cost function with respect to the considered parameter can provide optimal value. The general form of gradient descent:

For min J(θ), repeat the above equation until convergence, simultaneously update all θj.For more detailed information, see reference[9].

In simple words, Gradient Descent has an analogy in which we must imagine ourselves at the top of a mountain valley and left stranded and blindfolded, our objective is to reach the bottom of the hill. Feeling the slope of the terrain around you is what everyone would do. Well, this action is analogous to calculating the gradient descent, and taking a step is analogous to one iteration of the update to the parameters.

XI. LOGISTIC vs LINEAR REGRESSION

1) In logistic regression, the outcome (dependent variable) has only a limited number of possible values whereas linear regression, the outcome is continuous.

2) Logistic regression is used when the response variable is categorical in nature whereas Linear regression is used when response variable is continuous.

3) Logistic regression gives an equation which is of the form Y=1/1+e^-z, Linear regression gives an equation which is of the form Y =mx+C

4) Logistic regression uses maximum likelihood method[20] to arrive at the solution and cost function which causes large errors to be penalized to an asymptotic constant whereas Linear regression uses ordinary least squares method to minimize the errors and arrive at a best possible fit. For more information, see reference[10]

5) Graphical representation of Logistic vs Linear Regression:

XII. ROC Curve

The Receiver Operating Characteristics or the ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity or the sensitivity index d’, known as “d-prime” in signal detection and biomedical informatics, or recall in machine learning. The false-positive rate is also known as the fall-out and can be calculated as (1 — specificity). The ROC curve is thus the sensitivity as a function of fall-out. For more detailed information, see reference[11]

XIII. APPLICATIONS

Regressions can be used in real world applications such as:

1) Marketing: Logistic Regression can be used to predict if the subsidiary of the company will make profit, loss or just break even depending on the characteristic of the subsidiary operations.

2) Human Resources: The HR manager of a company can predict the absenteeism pattern of his employees based on their individual characteristic using Logistic Regression.

3) Finance: Bank uses Logistic Regression to predict if it’s customers would default based on the previous transactions and history.

For more information, see reference[14]

4) Science: Logistic Regression algorithm can be used to predict earthquakes. For more information, see reference[13]

XIV. ADVANTAGES

1) The logistic regression model not only acts as a classification model, but also gives you probabilities. This is a big advantage over other models where they can only provide the final classification. Knowing that an instance has a 99% probability for a class compared to 51% makes a big difference.

2) Logistic Regression not only gives a measure of how relevant a predictor (coefficient size) is, but also its direction of association (positive or negative). We see that Logistic regression is easier to implement, interpret and very efficient to train.

3) Logistic Regression proves to be very efficient when the dataset has features that are linearly separable.

4) In a low dimensional dataset having a enough training examples, logistic regression is less prone to over-fitting.

For more detailed information, see reference[12]

XV. DISADVANTAGES

1) Logistic regression can suffer from complete separation. If there is a feature that would perfectly separate the two classes, the logistic regression model can no longer be trained. This is because the weight for that feature would not converge. This is really a bit unfortunate, because such a feature is really very useful.

2) Logistic regression is less prone to overfitting but it can overfit in high dimensional datasets.

3) It is difficult to capture complex relationships using logistic regression. More powerful and complex algorithms such as Neural Networks can easily outperform this algorithm.

For more detailed information, see reference[12]

XVI. CONCLUSION

Logistic regression is a widely used supervised machine learning technique. It is one of the best tools used by statisticians, researchers and data scientists in predictive analytics. The assumptions for logistic regression are mostly similar to that of multiple regression except that the dependent variable should be discrete or non-linear .The Logistic regression provides a useful means for modelling the dependence of a binary response variable on one or more explanatory variables, where the latter can be either categorical or continuous. The fit of the resulting model can be assessed using several methods.

REFERENCES

[1]. S. Date, “The Binomial Regression Model: Everything You Need toKnow,” Medium, 10-Mar-2020. [Online]. Available: https://towardsdatascience.com/the-binomial-regression-model-everything-you-need-to-know-5216f1a483d3Full Article

[2]. “Multinomial Logistic Regression: Definition and Examples-Statistics…”. [Online]. Available: https://www.statisticshowto.com/multinomial-logistic-regression/Full Article

[3].“Ordinal Logistic Regression and its Assumptions — Full Analysis …”. [Online]. Available: https://medium.com/evangelinelee/ordinal-logistic-regression-on-world-happiness-report-221372709095.Full Article

[4] “Sigmoid Function — an overview | ScienceDirect Topics”. [Online]. Available: https://www.sciencedirect.com/topics/computer-science/sigmoid-function.Full Article

[5] “Understanding Logistic Regression — GeeksforGeeks”. [Online].Available: https://www.geeksforgeeks.org/understanding-logistic-regression/ Full Article

[6] “Logistic Regression for Machine Learning”. [Online]. Available: https://machinelearningmastery.com/logistic-

regression-for-machine-learning/ Full Article

[7] “Logistic Regression and Decision Boundary — Towards Data Science”. [Online]. Available: https://towardsdatascience.com/logistic-regression-and-decision-boundary-eab6e00c1e8].Full Article

[8] “The cost function in logistic regression — Internal Pointers”. [Online]. Available: https://www.internalpointers.com/post/cost-function-logistic-regression.Full Article

[9] “Gradient Descent Training With Logistic Regression -Best Machine …”. [Online]. Available: https://bestofml.com/gradient-descent-training-with-logistic-regression/.Full Article

[10] “Difference between Linear Regression and Logistic Regression | Pico”. [Online]. Available: https://www.pico.net/kb/difference-between-linear-regression-and-logistic-regression..Full Article

[11] “How to Use ROC Curves and Precision-Recall Curves for …”. [Online]. Available: https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/.

Full Article

[12] “Advantages and Disadvantages of Logistic Regression”. [Online]. Available: https://iq.opengenus.org/advantages-and-disadvantages-of-logistic-regression/. Full Article

[13] Abrahamson, N. A. & R. R. Youngs (1992). A stable algorithm for regression analysis using the random effects model. Bulletin of the Seismological Society of America 82(1), 505–510. Full Article

[14]“Introduction to Logistic Regression | Analytics Insight”. [Online]. Available: https://www.analyticsinsight.net/introduction-to-logistic-regression/. Full Article

[15] Murphy, K. Machine Learning — A Probabilistic Perspective. The MIT Press, 2012.

[16] ISLR: James, G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to Statistical Learning — with Applications in R, 7th edition. Springer, 2013.

[17] DL-Book: Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. The MIT Press, 2016.

[18] ISBN: Max, K., Johnson, K. Applied Predictive Modeling. 2nd edition. 2018

[19] ISBN: Sebastian, R., Vahid M. Python Machine Learning — Second Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow. 2nd edition. September 20, 2017

[20] “A Gentle Introduction to Logistic Regression With Maximum …”. [Online]. Available: https://analyticsweek.com/content/a-gentle-introduction-to-logistic-regression-with-maximum-likelihood-estimation/.

Logistic Regression

Written by Neeta Ganamukhi