For taking steps to know about Data Science and Machine Learning, in my last, third series, I covered briefly an introduction to Machine Learning, Regression and more specifically Linear Regression. In this fourth of the series, I shall cover Logistic Regression.

**Brief on Regression analysis:**

**Regression analysis** is a form of predictive modelling technique which investigates the relationship between a **dependent** (*target*) and **independent variable** (s) (*predictor*). This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables. For example, relationship between rash driving and number of road accidents by a driver is best studied through regression.

*A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions. — M.J. Moroney*

Regression analysis is an important tool for modelling and analyzing data. Here, we fit a curve / line to the data points, in such a manner that the differences between the distances of data points from the curve or line is minimized.

Benefits of using regression analysis:

1. It indicates the **significant relationships** between dependent variable and independent variable.

2. It indicates the **strength of impact** of multiple independent variables on a dependent variable.

Regression analysis also allows us to compare the effects of variables measured on different scales, such as the effect of price changes and the number of promotional activities. These benefits help market researchers / data analysts / data scientists to eliminate and evaluate the best set of variables to be used for building predictive models.

There are various kinds of regression techniques available to make predictions. These techniques are mostly driven by three metrics (number of independent variables, type of dependent variables and shape of regression line).

The most commonly used regressions:

· Linear Regression

· Logistic Regression

· Polynomial Regression

· Ridge Regression

· Lasso Regression

· ElasticNet Regression

**Introduction to Logistic Regression :**

Every machine learning algorithm works best under a given set of conditions. Making sure your algorithm fits the assumptions / requirements ensures superior performance. You can’t use any algorithm in any condition. For e.g.: We can’t use linear regression on a categorical dependent variable. Because we won’t be appreciated for getting extremely low values of adjusted R² and F statistic. Instead, in such situations, we should try using algorithms such as Logistic Regression, Decision Trees, Support Vector Machine (SVM), Random Forest, etc.

**Logistic Regression** is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.

Logistic Regression is one of the most popular ways to fit models for categorical data, especially for binary response data in Data Modeling. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the (0,1) interval); furthermore, those probabilities are well-calibrated when compared to the probabilities predicted by some other classifiers, such as Naive Bayes. Logistic regression preserves the marginal probabilities of the training data. The coefficients of the model also provide some hint of the relative importance of each input variable.

Logistic Regression is used when the dependent variable (target) is categorical.

For example,

- To predict whether an email is spam (1) or (0)
- Whether the tumor is malignant (1) or not (0)

Consider a scenario where we need to classify whether an email is spam or not. If we use linear regression for this problem, there is a need for setting up a threshold based on which classification can be done. Say if the actual class is malignant, predicted continuous value 0.4 and the threshold value is 0.5, the data point will be classified as not malignant which can lead to serious consequence in real time.

From this example, it can be inferred that linear regression is not suitable for classification problem. Linear regression is unbounded, and this brings logistic regression into picture. Their value strictly ranges from 0 to 1.

Logistic regression is generally used where the dependent variable is Binary or Dichotomous. That means the dependent variable can take only two possible values such as “Yes or No”, “Default or No Default”, “Living or Dead”, “Responder or Non Responder”, “Yes or No” etc. Independent factors or variables can be categorical or numerical variables.

**Logistic Regression Assumptions:**

· Binary logistic regression requires the dependent variable to be binary.

· For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome.

· Only the meaningful variables should be included.

· The independent variables should be independent of each other. That is, the model should have little or no multi-collinearity.

· The independent variables are linearly related to the log odds.

· Logistic regression requires quite large sample sizes.

Even though logistic (**logit**) regression is frequently used for binary variables (2 classes), it can be used for categorical dependent variables with more than 2 classes. In this case it’s called Multinomial Logistic Regression.

**Types of Logistic Regression:**

1. **Binary Logistic Regression**: The categorical response has only two 2 possible outcomes. E.g.: Spam or Not

2. **Multinomial Logistic Regression:** Three or more categories without ordering. E.g.: Predicting which food is preferred more (Veg, Non-Veg, Vegan)

3. **Ordinal Logistic Regression:** Three or more categories with ordering. E.g.: Movie rating from 1 to 5

**Applications of Logistic Regression:**

Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences. For e.g., the Trauma and Injury Severity Score (TRISS), which is widely used to predict mortality in injured patients, is developed using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression. Logistic regression may be used to predict the risk of developing a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.).

Another example might be to predict whether an Indian voter will vote BJP or TMC or Left Front or Congress, based on age, income, sex, race, state of residence, votes in previous elections, etc. The technique can also be used in engineering, especially for predicting the probability of failure of a given process, system or product.

It is also used in marketing applications such as prediction of a customer’s propensity to purchase a product or halt a subscription, etc. In economics it can be used to predict the likelihood of a person’s choosing to be in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are used in natural language processing.

Logistic Regression is used for prediction of output which is binary. For e.g., if a credit card company is going to build a model to decide whether to issue a credit card to a customer or not, it will model for whether the customer is going to “Default” or “Not Default” on this credit card. This is called “Default Propensity Modeling” in banking terms.

Similarly an e-commerce company that is sending out costly advertisement / promotional offer mails to customers, will like to know whether a particular customer is likely to respond to the offer or not. In Other words, whether a customer will be “Responder” or “Non Responder”. This is called “Propensity to Respond Modeling”

Using insights generated from the logistic regression output, companies may optimize their business strategies to achieve their business goals such as minimize expenses or losses, maximize return on investment (ROI) in marketing campaigns etc.

**Logistic Regression Equation:**

The underlying algorithm of Maximum Likelihood Estimation (MLE) determines the regression coefficient for the model that accurately predicts the probability of the binary dependent variable. The algorithm stops when the convergence criterion is met or maximum number of iterations are reached. Since the probability of any event lies between 0 and 1 (or 0% to 100%), when we plot the probability of dependent variable by independent factors, it will demonstrate an ‘S’ shape curve.

Logit Transformation is defined as follows-

**Logit = Log (p/1-p) = log (probability of event happening/ probability of event not happening) = log (Odds)**

Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model (GLM). The fundamental equation of generalized linear model is:

g(E(y)) = α + βx1 + γx2

Here, g() is the link function, E(y) is the expectation of target variable and α + βx1 + γx2 is the linear predictor (α,β,γ to be predicted). The role of link function is to ‘link’ the expectation of y to linear predictor.

Key Points :

- GLM does not assume a linear relationship between dependent and independent variables. However, it assumes a linear relationship between link function and independent variables in logit model.
- The dependent variable need not to be normally distributed.
- It does not uses OLS (Ordinary Least Square) for parameter estimation. Instead, it uses maximum likelihood estimation (MLE).
- Errors need to be independent but not normally distributed.

To understand, consider the following example:

We are provided a sample of 1000 customers. We need to predict the probability whether a customer will buy (**y**) a particular magazine or not. As we’ve a categorical outcome variable, we’ll use logistic regression.

To start with logistic regression, first write the simple linear regression equation with dependent variable enclosed in a link function:

g(y) = βo + β(Age) — — (a)

For understanding, consider ‘*Age*’ as independent variable.

In logistic regression, we are only concerned about the probability of outcome dependent variable (success or failure). As described above, g() is the link function. This function is established using two things: Probability of Success(p) and Probability of Failure(1-p). p should meet following criteria:

- It must always be positive (since p >= 0)
- It must always be less than equals to 1 (since p <= 1)

Now, simply satisfy these 2 conditions and get to the core of logistic regression. To establish link function, we denote g() with ‘p’ initially and eventually end up deriving this function.

Since probability must always be positive, we’ll put the linear equation in exponential form. For any value of slope and dependent variable, exponent of this equation will never be negative.

p = exp(βo + β(Age)) = e^(βo + β(Age)) — — — — (b)

To make the probability less than 1, divide p by a number greater than p. This can simply be done by:

p = exp(βo + β(Age)) / exp(βo + β(Age)) + 1 = e^(βo + β(Age)) / e^(βo + β(Age)) + 1 — — — ©

Using (a), (b) and ©, we can redefine the probability as:

p = e^y/ 1 + e^y — — (d)

*where *p is the probability of success.* This (d) is the Logit Function*

If p is the probability of success, 1-p will be the probability of failure which can be written as:

q = 1 — p = 1 — (e^y/ 1 + e^y) — — (e)

*where* q is the probability of failure

On dividing, (d) / (e), we get,

p/(1-p) = e^y

After taking *log* on both side, we get,

log(p/(1-p)) = y

log(p/1-p) is the link function. Logarithmic transformation on the outcome variable allows us to model a non-linear association in a linear way.

After substituting value of y, we’ll get:

log(p/(1-p)) = βo + β(Age)

This is the equation used in Logistic Regression. Here (p/1-p) is the odd ratio. Whenever the log of odd ratio is found to be positive, the probability of success is always more than 50%. A typical logistic model plot is shown below. It shows probability never goes below 0 and above 1.

Logistic regression predicts the probability of an outcome that can only have two values (i.e. a dichotomy). The prediction is based on the use of one or several predictors (numerical and categorical). A linear regression is not appropriate for predicting the value of a binary variable for two reasons:

- A linear regression will predict values outside the acceptable range (e.g. predicting probabilities

outside the range 0 to 1) - Since the dichotomous experiments can only have one of two possible values for each experiment, the residuals will not be normally distributed about the predicted line.

On the other hand, a logistic regression produces a logistic curve, which is limited to values between 0 and 1. Logistic regression is similar to a linear regression, but the curve is constructed using the natural logarithm of the “odds” of the target variable, rather than the probability. Moreover, the predictors do not have to be normally distributed or have equal variance in each group.

In the logistic regression the constant (*b0*) moves the curve left and right and the slope (*b1*) defines the steepness of the curve. Logistic regression can handle any number of numerical and/or categorical variables.

There are several analogies between linear regression and logistic regression. Just as ordinary least square regression is the method used to estimate coefficients for the best fit line in linear regression, logistic regression uses **maximum likelihood estimation (MLE)** to obtain the model coefficients that relate predictors to the target. After this initial function is estimated, the process is repeated until **LL** (Log Likelihood) does not change significantly.

**Performance of Logistic Regression Model (Performance Metrics):**

To evaluate the performance of a logistic regression model, we must consider few metrics. Irrespective of tool (SAS, R, Python) we would work on, always look for:

1. **AIC (Akaike Information Criteria)** — The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value.

2. **Null Deviance and Residual Deviance** — Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.

3. **Confusion Matrix:** It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the accuracy of the model and avoid over-fitting. This is how it looks like:

Specificity and Sensitivity plays a crucial role in deriving ROC curve.

4. **ROC Curve:** Receiver Operating Characteristic (ROC) summarizes the model’s performance by evaluating the trade-offs between true positive rate (sensitivity) and false positive rate (1- specificity). For plotting ROC, it is advisable to assume p > 0.5 since we are more concerned about success rate. ROC summarizes the predictive power for all possible values of p > 0.5. The area under curve (AUC), referred to as index of accuracy (A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under curve, better the prediction power of the model. Below is a sample ROC curve. The ROC of a perfect predictive model has TP equals 1 and FP equals 0. This curve will touch the top left corner of the graph.

For model performance, we can also consider likelihood function. It is called so, because it selects the coefficient values which maximizes the likelihood of explaining the observed data. It indicates goodness of fit as its value approaches one, and a poor fit of the data as its value approaches zero.

**Summary :**

**Logistic Regression** is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables. We can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a **logit** function.

· It is widely used for **classification problems**

· Logistic regression doesn’t require linear relationship between dependent and independent variables. It can handle various types of relationships because it applies a non-linear log transformation to the predicted odds ratio

· To avoid over fitting and under fitting, we should include all significant variables. A good approach to ensure this practice is to use a step wise method to estimate the logistic regression

· It requires **large sample sizes** because maximum likelihood estimates are less powerful at low sample sizes than ordinary least square

· The independent variables should not be correlated with each other i.e. **no multi collinearity**. However, we have the options to include interaction effects of categorical variables in the analysis and in the model.

· If the values of dependent variable is ordinal, then it is called as **Ordinal logistic regression**

· If dependent variable is multi class then it is known as **Multinomial Logistic regression**.

*“Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world.” — Atul Butte, Stanford University”*