Logistic Regression in Machine Learning

Published in

Analytics Vidhya

6 min readMay 19, 2021

Linear Regression vs Logistic Regression

Both are supervised learning models and make use of labeled data for making predictions.
Linear regression is used for regression(prediction)problems whereas Logistic regression can be used in both classification and regression problems but is widely used as a classification algorithm.

Logistic regression uses the concept of predictive modeling as regression; therefore, it is called logistic regression, but is used to classify samples; Therefore, it falls under the classification algorithm.

3. Logistic regression is used when the dependent variable is binary such as click on a given advertisement link or not, spam detection. Diabetes prediction, the customer will purchase or not, an employee will leave the company or not whereas Linear regression is used when the dependent variable is continuous such as price, age, salary, etc

4. Linear regression uses Ordinary Least Squares (OLS) i.e distance-minimizing approximation while Logistic regression uses Maximum Likelihood Estimation (MLE) approach i.e it determines the parameters(mean and variance)that are maximizing the likelihood to produce the desired output.

Type of Logistic Regression:

On the basis of the Dependent variable, Logistic Regression can be classified into three types:

Binomial: There can be only two possible types of the dependent variables, such as 0 or 1, Pass or Fail, Purchased or Not Purchased, Tall or Short, Fat or Slim, Rock or Mine, etc.
Multinomial: There can be 3 or more possible unordered types of the dependent variable, such as apple, banana, orange or cat, dog, goat, sheep or Delhi, Mumbai, Bangalore, Calcutta.
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as High, medium, low, or ratings of a restaurant from 1 to 5 or the intensity of the light, or a 5 points Likert scale, etc

Linear Regression Equation:

Where y is the dependent variable and X1, X2 … and Xn are independent variables

1. With more outliers, the best fit line will get deviated which would reduce the accuracy of the system, Hence Linear Regression is very sensitive to the outliers.

2. It isn't suitable for data points with outputs exceeds the limits of 0 to 1.

To overcome such shortcomings, Logistic Regression is used.

Logistic Regression in Machine Learning:

Logistic Regression uses a sigmoid or logit function which will squash the best fit straight line that will map any values including the exceeding values from 0 to 1 range. So it forms an “S” shaped curve.
Sigmoid fun removes the effect of outlier and makes the output between 0 to 1.

Apply Sigmoid function on linear regression

Model building in Scikit-learn :

Let's build a model to using the SONAR traditions of a Ship to Classify if the detected object is a ROCK or a MILE. This is a Binary Classification problem.

You can download the dataset from Kaggle using the following link: https://www.kaggle.com/mattcarter865/mines-vs-rocks

You can follow along with the Jupyter notebooks from my Github repository.

1. Load the necessary libraries and the data

Analyse the Dependent Variable: The data contains 111 Mines and 97 Rock.

2. Feature Selection:

Divide the given 61 columns into two types:

1 Independent or feature variables 2.Dependent or target variable

Plot a scatter plot to visualize the distribution on the Feature Variables either Rock or Mules.

3. Splitting Data

Splitting the dataset into a training set and a test set to understand model performance.

Here, the Dataset is broken into two parts in a ratio of 80:20. It means 80% of data will be used for model training and 20% for model testing.

Stratify= y splits the data w.r.t the Dependent Variable equally to avoid Data Imbalance.

4. Deploying the model

Import the class — Logistic Regression, Instantiate the model using the LogisticRegression() function.

Then, fit your model on the train set using fit() and perform prediction on the test set using predict().

Calculate the accuracy scores of Training and testing datasets

Accuracy scores of Training dataset : 0.8192771084337349
Accuracy scores of Testing dataset : 0.8809523809523809

5. Model Evaluation using Confusion Matrix

A confusion matrix is a 2X2(model is binary classification) table
It is used to evaluate the performance of a classification model.
It is the sum of the correct and incorrect predictions class-wise.(Rock and Mines)

There are two classes Rock and Mines.
Diagonal elements(20,17)represent accurate predictions.
while Non-Diagonal elements(2,3) represent inaccurate predictions.

Visualizing Confusion Matrix using Heatmap

Calculate the accuracy score of the matrix

Accuracy of Confusion Matrix:0.8809523809523809

6.Model Evaluation using Input data

Create a function that takes all the features of the output data, Convert the row into an array, reshape the data into 1 row with its feature column
and Predict the data class — Rock or Miles.

Check the model for the input data.

101st row belongs to a “Mile” class and 1st row belongs to a “Rock” class

We are getting the Correct Output. Hence Our Model has accurately classified its dependent Variable.

Advantages

Highly efficient
Doesn’t require high computation power
Easy to implement, easily interpretable
It doesn’t require scaling of features.
Logistic regression provides a probability score for observations.

Disadvantages

Logistic regression cannot handle a large number of categorical features/variables.
It is vulnerable to overfitting.
It can’t solve the non-linear problem with the logistic regression which is why it requires a transformation of non-linear features.
Logistic regression will not perform well with independent variables that are not correlated to the target variable and are very similar or correlated to each other.