ML Algorithm: Logistic Regression for a Base Model.

Madhuri Patil
7 min readNov 3, 2022

--

Photo by Gérôme Bruneau on Unsplash

Often the real-world Supervised Machine Learning problems are Classification Problems rather than Regression, where we need to predict the qualitative values often referred to as categories or classes. For example: Predicting whether a customer will churn or not, Or a problem where we need to find whether a tumor is malignant or benign.

Regression algorithms predict the numeric output values, whereas the Classification algorithm predicts a particular class to which the given instance belongs. The process of predicting categorical response values is known as classification;

Logistic Regression is considered one of the first choices for base models in classification problems. In this article, you will learn the basics of logistic regression and why it is named regression, and its mathematical implementation with an example of a case study.

Before learning about the Logistic Regression algorithm, Let’s first understand why we need another algorithm for prediction and why we can not use Linear Regression.

Let’s consider a classic example of machine learning; the iris-flower dataset, where we need to predict the class of the iris-flower out of the three species(Iris-setosa, Iris-virginica, Iris-versicolor ) using the Linear Regression algorithm.

The linear regression algorithm assumes that the response variable Y is a numeric value. So, let’s convert the categorical variable Y values into a numerical response value as 1 for Iris-setosa, 2 for Iris-virginica, and 3 for Iris-versicolor.

After training a model on the data, the algorithm learns the relation between Y values in such a way that the difference between Iris-setosa and Iris-virginica is the same as the difference between Iris-virginica and Iris-versicolor. But in reality, there is no such ordering, and one could choose any other numeric values, which would result in a new model with new relationships, that might predict the different sets of values for predictions of test data.

If there is natural ordering in the response variable values(such as low, medium, and high), then encoding it with ordinal data with the same gap would be reasonable, But most of the time, it is not practical.

For binary classification(for only two classes, such as True/False or 1/0). For example, if we need to predict whether the customer will churn or not, then we could approach the classification problem ignoring the fact that y is discrete-valued, and use our linear regression algorithm to predict the output y for given X values.

We could predict that the customer will churn if y >= 0.5 and will not churn if it is less than 0.5. But sometimes the output is hard to interpret since the predicted value could be higher than one or smaller than 0. But in real y must be in the range of 0 and 1.

Representation

To avoid this problem, we must use the function that results in the output in a range of 0 and 1 for all values of X. For that, we use a logit function or sigmoid function by applying it to the linear regression model function.

The hypothesis function for linear regression is represented as

Here, theta represents the parameters of the model and x is the input vector.

The logit function is represented as follows:

Here,

Applying the logit function on the linear regression function gives us the representation of the logistic function:

This is the representation of the Logistic Regression model. By computing the values of theta’s for given input X, we can calculate the value of the hypothesis function, and then by applying the logit function to our hypothesis, we can decide the class for the given instance.

The logit function does not produce the actual output ( for binary classification, the actual output is 1 or 0) instead it computes the probabilities for each class.

If it is less than 0.5 then typically the label is 0 or negative; otherwise, if it is greater than 0.5, the output label is 1 or positive. (this is for binary classification)

Evaluation

The cost function for logistic regression depends on the data point’s actual class. As said earlier, the function’s output is the probabilities of the classes. Suppose, the output of the logit function is 0.75 for a data point whose class is 1, then the error or loss of that case is 0.25 (1–0.75). But if that data point is of the category 0 then the error is 0.75.

In multi-class classification, we calculate the probabilities of each class and choose one class at a time as a positive(1) class while other classes are a negative(0) class, then calculate the hypothesis.

Assumptions for Logistic Regression

  • Logistic regression assumes that the dependent variable y must be categorical in nature.
  • It assumes that there is no multi-collinearity between independent features.

What is difference between the Linear and Logistic Regression?

Apart from the assumptions for the dependent variables y for these algorithms. The key difference between these two regression models is that

  • The output of the linear regression is a convex function with a single global minimum, while the output of the Logistic regression function is a non-convex function with multiple local minimums.
  • Logistic regression predicts the probabilities of particular output while Linear Regression predicts the actual output.

Types of Logistic Regression

  • Binomial Logistic Regression — In binomial logistic regression, there are only two possible outcomes, for example, 1 or 0, True or False, etc.
  • Multinominal Logistic Regression — In multi-nominal logistic regression, there are more than two possible unordered outcomes. For example, the Iris-flower species Iris-setosa, Iris-virginica, and Iris-versicolor.
  • Ordinal Logistic Regression — In ordinal logistic regression, there are more than two possible ordered outcomes, for example, low, medium, or high.

Implementation using Scikit-learn

Let’s implement the logistic regression algorithm using the popular Python library for machine learning algorithms scikit-learn.

For our classification example, we will be using churn prediction data, you can download this dataset from Kaggle.

Our objective here is to build a Logistic regression model using a scikit-learn library to predict whether the customer will churn or not. Here, the data used to train a model is preprocessed data.

The first step after preparing the whole dataset is to split the data into training and test set and optionally into a validation set. So that model can learn by training it on a training dataset, evaluating and tuning on a validation dataset, and checking its performance on the test dataset.

Let’s first import the necessary libraries and read the data into the pandas' data frame using the read_csv function.

# Importing libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Load the data
df = pd.read_csv("data.csv")

Now, let’s split the data into training and test sets. We will use a test set, which is 30% of the overall dataset.

# Split the data
df_train, df_test = train_test_split(df, test_size=0.3, random_state=1)

# shape
print("Training shape:: ", df_train.shape)
print("Testing shape:: ", df_test.shape)
## Ouput
Training shape:: (4930, 21)
Testing shape:: (2113, 21)

From the output, we can see that our train dataset has 4930 training examples or data points and the test dataset has 2113 examples. In the next step, we will transform the data and train a Logistic Regression model on a training dataset.

# create y_train and y_test
y_train = df_train['churn']
y_test = df_test['churn']
# delete `churn` attribute from data
del df_train['churn']
del df_test['churn']
train_dict = df_train.to_dict(orient='records')
test_dict = df_test.to_dict(orient='records')
# Data transformation
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)
X_test = dv.transform(test_dict)
# Train the model
clf = LogisticRegression(max_iter=1000, solver='liblinear')
clf.fit(X_train, y_train)
# Prediction
y_preds = clf.predict_proba(X_train)
print(y_preds[0])
## Output - [0.8218741 0.1781259]

These are predicted probabilities for the first instance in the training set. predict_proba function of a classifier will return the probability of each class. Since this is a binary classification problem, the above output represents an array of probabilities of class 0 and class 1 respectively for the first example of the train set.

The following code will select probabilities of class 1 since we want to predict the customer churn rate.

# Model evaluation
y_preds = clf.predict_proba(X_train)[:, 1]
auc = roc_auc_score(y_train, y_preds)
print("Train score:: %.3f" % auc)
y_preds = clf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_preds)
print("Test score:: %.3f" % auc)
## Output -
Train score:: 0.845
Test score:: 0.858

From the above accuracy score, we can say that our logistic regression model will predict whether the customer will churn or not with an accuracy of 85%.

In this article, we learned the Logistic Regression algorithm for classification with an example of churn prediction with a brief introduction of mathematical explanation. Even though Logistic regression is a combination of a hypothesis function from linear regression and a logit function. The evaluation of the cost function of logistic regression is distinct from linear regression. You can learn logistic regression in detail here

I hope this brief introduction helps you to understand some of the basic concepts of Logistic Regression algorithms.

Thank you for reading!

--

--