Logistic Regression

Published in

CodeX

5 min readAug 25, 2022

Logistic regression is a commonly used classification model. In this model, the dependent variable or the target value is a discrete binary value i.e. 1or 0 suggesting pass or fail, win or loss, true or false.

Although it is a classification model, the term regression in its name suggests that this model works similarly to that of regression which is predictive modelling. Instead of fitting a regression line (just like in linear regression) we fit an ‘S’ curve which is known as the Sigmoid Curve which predicts two values 0 or 1. This ‘S’ curve indicates the maximum likelihood of an event.

LOGISTIC REGRESSION EQUATION

We will derive the logistic equation from the straight line equation. let there are two features x1 and x2 so the linear relation between the features and target value will be y = Ax1 + Bx2 + C, but the range of y here is negative infinity to infinity.

For logistic regression, we need y to be zero or one so we manipulate the equation as y/(1-y), now the range is zero to infinity to make the range zero to one, we take the logarithm and the required logistic equation is given by log(y/(1-y))

SIGMOID PROBABILITY

The probability of target y is restricted to 0 or 1, this is called sigmoid probability. Mathematically,

here ‘t’ is the data values i.e. the values of the features ‘X’
S(t) represents the probability of being true or false i.e. value of the dependent variable ‘Y’

This mathematical function gives an ‘S’ curve which has a finite limit of 0 to 1, 0 when ‘t’ approaches -infinity and 1 when ‘t’ approaches +infinity

A sigmoid function which gives the value 0 or 1

CONFUSION MATRIX

A confusion matrix is a matrix which is used for evaluating the performance of the classification model, this matrix compares the predicted values of the target variable with its actual values.

TN: True Negative (number of actual false which are predicted correctly)
TP: True Positive (number of actual true which are predicted correctly)
FN: False Negative (number of predicted false which are actually true)
FP: False Positive (number of predicted true which are actually false)

SIGNIFICANCE OF CONFUSION MATRIX

The confusion matrix is used to determine some important measures like accuracy, precision, recall/sensitivity and f-1 score.

Accuracy

It simply means that how accurately the model has predicted the target values, is given by the formula :

Precision

It refers to the correctness achieved in the prediction, it simply tells us about the actual positives from the total predicted positives, it is given by the formula:

Recall

It tells us about the sensitivity of the model by determining how many of the actual positives are predicted correct, it is calculated by:

F-1 Score

It helps us to evaluate the recall and precision at the same time to compare two models, it is calculated by:

LOGISTIC REGRESSION IN PYTHON

We will use Scikit Learn Library to implement logistic regression and confusion matrix on the Titanic dataset taken from kaggle.com. We have used the train.csv dataset in this example.

Importing all the necessary libraries

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,precision_score
from sklearn.metrics import recall_score,confusion_matrix
from sklearn.model_selection import train_test_split as tts

Reading the data

data=pd.read_csv("./titanic.csv")
print(data.shape)

data.head(5)

data.describe()

data.Survived.value_counts()
# count the number of survivors

Data Cleaning (in this example we are not performing extensive data cleaning)

#considering only important fields
columns=['Pclass','Survived','Sex','Fare','Age']
data=data[columns]
data.head()

#returning true at the places where null values are present
print(data.isnull())

#removing null data points from the dataset
data.dropna(axis=0,how='any',inplace=True)
#we can see that the data points are reduced because null data points are removed

Splitting the data for training and making predictions

#splitting the target column and the features
X=data[['Pclass','Fare','Age']]
Y=data['Survived']#splitting the data for training the model
x_train,x_test,y_train,y_test=tts(X,Y,test_size=0.2,random_state=42)print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

Training the model and predicting the target values

model=LogisticRegression()
model.fit(x_train,y_train)predicted_y = model.predict(x_test)
print(predicted_y.shape)

Measuring the Accuracy, Precision, Recall and F-1 Score using Confusion Matrix

c_matrix=confusion_matrix(y_test,predicted_y)
print(c_matrix)

print("True Negtive = ",c_matrix[0][0])
print("False Positive = ",c_matrix[0][1])
print("False Negtive = ",c_matrix[1][0])
print("True Positive = ",c_matrix[1][1])

print(accuracy_score(y_test,predicted_y))print(precision_score(y_test,predicted_y))print(recall_score(y_test,predicted_y))

You can verify these measures by calculating all these measures manually by using the formulas mentioned above. The acceptance of the measure is based on the requirement, for example, we can set a threshold value for each measure and if that measure gives the required value then the model is accepted.

LOGISTIC REGRESSION USE CASE

We can use Logistic Regression in all the scenarios where the target can be divided into 2 categories, for example

Spam detection
Loan Sanction
Exam Result Prediction
Survival of a Disaster

In all these cases the outcome is a binary (0,1)for example for a spam detection system the entity will be either spam or not spam; for loan sanction, the bank will either sanction the loan or not; for survival prediction the person will either survive or die.

I hope you all understand what is logistic regression and how we measure the accuracy of the model.
And stay connected to learn more about Machine Learning.