In the machine learning, we always come up to term name as Logistics whenever we have to do classification modelling.
Logistic regression is used when we have our dependent variable ( Y ) in binary form or Binary categorical. Our dependent variable can take only 1 or 0 values. Logistic Regression is basically used to predict for the event 1 by using the mathematical formula in the given dependent variable.
In this article, I will explain the logistic regression with the help of an example.
Contents — — — — — — — — — — — — — — — — — — — — — — — — — —
- Difference between Logistics and Linear regression
- Why we cannot use the linear regression for the classification problems
- Building logistic regression in R
- Understanding the dataset
- Class imbalance problem
- Predicting by using logistic regression
- The accuracy of the model
Difference between Logistic and Linear Regression
Linear Regression is used to define the relationship between the dependent and independent variable. We have the continuous dependent variable Y in the Linear regression.
Logistic Regression is used to predict the probability of event 1 and 0. we have the categorical variable Y which having two values either 0 or 1.
Why we cannot use the linear regression for the classification problems
when we have a dependent variable in the form of value or categorical form like 0 or 1 and we need to predict the dependent variable in the form of 0 or 1 or probability in the range of 0 or 1 we cannot use the linear regression due to its inability to predict the probability or occurrence of event in the form of 0 or 1.
if we use the linear regression for the prediction of the two events 0 or 1. it won't be able to predict the result in the range of 0 or 1. it will exceed the rage of 0 or 1 that's why we cannot use the linear regression for the classification problems.
Building logistic regression in R
In this article, we are going to use the dataset related to the Social advertising ads.
The goal here is to predict the purchase event on the basis of age, gender and estimated salary.
In the given dataset we have 5 columns.
In the above-given dataset, we need to predict the event of purchase on the basis of other given features.
The purchase column is our dependent variable which is given in the form of the 0 and 1.
Let's check the summary of the dataset
let's check the structure of the dataset
In the logistics regression, you need to convert all the variable into the numerical apart from your dependent variable. We can ignore the column of id’s as we won't require any ordering column.
( Key point:- Whenever we are converting a factor into the numeric, first we need to convert it to the character and then into the numeric )
Selecting the relevant column for our logistic model from the dataset
#taking only relevant column from the dataset
dataset = dataset[3:5]
Dependent Variable should always in factor and other feature columns should be in numeric.
# Encoding the target feature as a factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
Class imbalance Problem
We need to check for the class imbalance of the dataset before splitting the dataset into the training and testing.
As we know that the dependent variable is a binary categorical variable, we need to make sure that training dataset is equally divided or equally proportion of dependent variable or class.
#check for the class imbalance in the dataset for the dependent variable which you need to predict.
#this shows my dataset is highly imbalanced for the event 0 and 1 as it is divided in 2:1 for the event 0 and 1 in the given dataset.
( We need to apply sampling techniques to solve the problem of the class imbalance )
We can use sampling technique like up-sampling, downsampling and hybrid sampling using SMOTE and ROSE. In this article, we are going to use downsampling.
In Downsampling, the majority class is randomly downsampled to be of the same size as the smaller class.
‘%ni%’ <- Negate(‘%in%’) # define ‘not in’ func
# Splitting the dataset into the Training set and Test set
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
In above , we have divided the dataset into the training and testing.
In the above , its show that our training dataset is not balanced .
training_set_S <- downSample(x = training_set[, colnames(training_set) %ni% “Purchased”],
y = training_set$Purchased)
we are using downsampling and removing Purchased column from the above code.
The %ni% is the negation of the %ni% function and I have used it here to select all the columns except the dependent column.
The downsample function requires the 'y' as a factor variable.
We have renamed the column purchased with the class.
Predicting by using logistic regression
# Feature Scaling
training_set_S[-3] = scale(training_set_S[-3])
test_set[-3] = scale(test_set[-3])
# Fitting Logistic Regression to the Training set
classifier = glm(formula = Class ~ .,
family = binomial,
data = training_set_S)
We did the feature scaling to make the dataset in the same range, as this help in the perfromance of the modelling. the function name glm is used for the logistic regresssion.
# Predicting the Test set results
prob_pred = predict(classifier, type = ‘response’, newdata = test_set[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
y_pred_N <- factor(y_pred, levels=c(0, 1))
y_act <- test_set$Purchased
The common practice is to take the probability cutoff as 0.5. If the probability of Y is > 0.5, then it can be classified an event 1 else 0.
Accuracy of the model
Let’s compute the accuracy, which is nothing but the proportion of y_pred_N that matches with y_act.
mean(y_pred_N == y_act)
#83 % accuracy
we have secured 83% of the accuracy for our prediction.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
Find me on Linkedin. I love to have conversations about Machine Learning.
(Edit: This article is still being updated and revised)