Logistic Regression

In the machine learning, we always come up to term name as Logistics whenever we have to do classification modelling.

Logistic regression is used when we have our dependent variable ( Y ) in binary form or Binary categorical. Our dependent variable can take only 1 or 0 values. Logistic Regression is basically used to predict for the event 1 by using the mathematical formula in the given dependent variable.

In this article, I will explain the logistic regression with the help of an example.

Contents — — — — — — — — — — — — — — — — — — — — — — — — — —

  1. Difference between Logistics and Linear regression
  2. Why we cannot use the linear regression for the classification problems
  3. Building logistic regression in R
  4. Understanding the dataset
  5. Class imbalance problem
  6. Predicting by using logistic regression
  7. The accuracy of the model
Difference between Logistic and Linear Regression

Linear Regression is used to define the relationship between the dependent and independent variable. We have the continuous dependent variable Y in the Linear regression.

Logistic Regression is used to predict the probability of event 1 and 0. we have the categorical variable Y which having two values either 0 or 1.

Why we cannot use the linear regression for the classification problems

when we have a dependent variable in the form of value or categorical form like 0 or 1 and we need to predict the dependent variable in the form of 0 or 1 or probability in the range of 0 or 1 we cannot use the linear regression due to its inability to predict the probability or occurrence of event in the form of 0 or 1.

if we use the linear regression for the prediction of the two events 0 or 1. it won't be able to predict the result in the range of 0 or 1. it will exceed the rage of 0 or 1 that's why we cannot use the linear regression for the classification problems.

Building logistic regression in R

In this article, we are going to use the dataset related to the Social advertising ads.

The goal here is to predict the purchase event on the basis of age, gender and estimated salary.

In the given dataset we have 5 columns.

First 5 rows of the dataset

In the above-given dataset, we need to predict the event of purchase on the basis of other given features.

The purchase column is our dependent variable which is given in the form of the 0 and 1.

Let's check the summary of the dataset

Summary of the Dataset

let's check the structure of the dataset

Structure of the dataset

In the logistics regression, you need to convert all the variable into the numerical apart from your dependent variable. We can ignore the column of id’s as we won't require any ordering column.

( Key point:- Whenever we are converting a factor into the numeric, first we need to convert it to the character and then into the numeric )

Selecting the relevant column for our logistic model from the dataset

#taking only relevant column from the dataset
dataset = dataset[3:5]

Dependent Variable should always in factor and other feature columns should be in numeric.

# Encoding the target feature as a factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

Class imbalance Problem

We need to check for the class imbalance of the dataset before splitting the dataset into the training and testing.

As we know that the dependent variable is a binary categorical variable, we need to make sure that training dataset is equally divided or equally proportion of dependent variable or class.

#check for the class imbalance in the dataset for the dependent variable which you need to predict.
table(dataset$Purchased)

#0 1 
#257 143

#this shows my dataset is highly imbalanced for the event 0 and 1 as it is divided in 2:1 for the event 0 and 1 in the given dataset.

( We need to apply sampling techniques to solve the problem of the class imbalance )

We can use sampling technique like up-sampling, downsampling and hybrid sampling using SMOTE and ROSE. In this article, we are going to use downsampling.

In Downsampling, the majority class is randomly downsampled to be of the same size as the smaller class.

#downsampling
library(caret)
‘%ni%’ <- Negate(‘%in%’) # define ‘not in’ func
options(scipen=999)
# Splitting the dataset into the Training set and Test set
# install.packages(‘caTools’)
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

In above , we have divided the dataset into the training and testing.

table(training_set$Purchased)
#0 1 
#193 107

In the above , its show that our training dataset is not balanced .

#downsampling
training_set_S <- downSample(x = training_set[, colnames(training_set) %ni% “Purchased”],
 y = training_set$Purchased)

we are using downsampling and removing Purchased column from the above code.

The %ni% is the negation of the %ni% function and I have used it here to select all the columns except the dependent column.

The downsample function requires the 'y' as a factor variable.

Training Dataset is balanced now

We have renamed the column purchased with the class.

Predicting by using logistic regression

# Feature Scaling
training_set_S[-3] = scale(training_set_S[-3])
summary(training_set_S)

test_set[-3] = scale(test_set[-3])
summary(test_set)

# Fitting Logistic Regression to the Training set
classifier = glm(formula = Class ~ .,
 family = binomial,
 data = training_set_S)

summary(classifier)

We did the feature scaling to make the dataset in the same range, as this help in the perfromance of the modelling. the function name glm is used for the logistic regresssion.

# Predicting the Test set results
prob_pred = predict(classifier, type = ‘response’, newdata = test_set[-3])

y_pred = ifelse(prob_pred > 0.5, 1, 0)
y_pred_N <- factor(y_pred, levels=c(0, 1))
y_act <- test_set$Purchased

The common practice is to take the probability cutoff as 0.5. If the probability of Y is > 0.5, then it can be classified an event 1 else 0.

Accuracy of the model

Let’s compute the accuracy, which is nothing but the proportion of y_pred_N that matches with y_act.

#accurcay
mean(y_pred_N == y_act) 
#83 % accuracy

we have secured 83% of the accuracy for our prediction.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

Find me on Linkedin. I love to have conversations about Machine Learning.

(Edit: This article is still being updated and revised)