What is Logistic Regression?

9 min readAug 9, 2023

Logistic regression is a type of supervised learning algorithm that belongs to the family of linear models. It is similar to linear regression, but instead of predicting a continuous value, it predicts the probability of an event occurring. For example, you can use logistic regression to predict whether a customer will buy a product or not, based on their age, gender, income, and other features.

The basic idea of logistic regression is to find a linear combination of the features that best separates the two classes. However, since the output is a probability, we need to transform the linear combination into a value between 0 and 1. This is done by applying a special function called the sigmoid function or the logistic function. The sigmoid function has an S-shaped curve that maps any real number to a value between 0 and 1. Here is the formula for the sigmoid function:

The graph of the sigmoid function looks like this:

The sigmoid function has some nice properties that make it suitable for logistic regression. For example, it is monotonic, which means that it increases or decreases as x increases or decreases. It also has a clear interpretation: the higher the value of x, the higher the probability of the event occurring.

To perform logistic regression, we need to find the optimal values of the coefficients that multiply the features in the linear combination. These coefficients are also called weights or parameters. We can use different methods to find these values, such as gradient descent, Newton’s method, or maximum likelihood estimation. The goal is to minimize a loss function that measures how well the model fits the data. One common loss function for logistic regression is called log loss or cross-entropy loss. It is defined as follows:

where n is the number of observations, yi is the actual label (0 or 1) of the i-th observation, and y^i is the predicted probability of the i-th observation.

The log loss function penalizes wrong predictions by comparing them with the actual labels. The lower the log loss, the better the model fits the data.

How to Implement Logistic Regression in Python?

There are several packages in Python that can help us implement logistic regression easily and efficiently. Some of the most popular ones are scikit-learn, stats models, and TensorFlow. In this article, I will focus on scikit-learn and stats models, as they are more user-friendly and widely used for data analysis and machine learning.

Logistic Regression in Python with scikit-learn

scikit-learn is one of the most popular and comprehensive packages for machine learning in Python. It provides a variety of tools and algorithms for data preprocessing, feature extraction, model selection, evaluation, and more. It also has a consistent and simple interface that makes it easy to use and integrate with other packages.

To perform logistic regression in Python with scikit-learn, you need to follow these steps:

Import the LogisticRegression class from the sklearn.linear_model module.
Create an instance of the LogisticRegression class and pass some parameters that control how the model is trained and fitted. Some of these parameters are penalty, C, solver, max_iter, and multi_class.
Use the fit method to train the model on the data, passing the features and labels as arguments.
Use the predict or predict_proba methods to make predictions on new data, passing the features as an argument.
Use the sklearn.metrics module to evaluate the performance of the model using different metrics such as accuracy, precision, recall, f1-score, roc curve, and auc score.
Here is a simple code snippet that uses the iris dataset, which is a famous dataset that contains 150 observations of three different species of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The code performs the following steps:
Import the LogisticRegression class from the sklearn.linear_model module and the load_iris function from the sklearn.datasets module.
Load the iris dataset and assign the features and labels to X and y variables, respectively.
Create an instance of the LogisticRegression class with default parameters and assign it to a variable called model.
Use the fit method to train the model on the data, passing X and y as arguments.
Use the predict method to make predictions on the same data, passing X as an argument, and assign the result to a variable called y_pred.
Use the accuracy_score function from the sklearn.metrics module to calculate the accuracy of the model on the data, passing y and y_pred as arguments, and print the result.

Here is the code:

# Import LogisticRegression and load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load the iris dataset
X, y = load_iris(return_X_y=True)# Create a logistic regression model
model = LogisticRegression()# Train the model on the data
model.fit(X, y)# Make predictions on the data
y_pred = model.predict(X)# Calculate and print the accuracy score
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y, y_pred)
print(f"The accuracy of the model is {accuracy:.2f}")

The output of this code is:

The accuracy of the model is 0.97

This means that the model correctly predicted 97% of the labels in the iris dataset. This is a very high accuracy score, but it may not be very realistic, since we used the same data for both training and testing. To get a more accurate estimate of how well the model can generalize to new data, we should use a train-test split or cross-validation technique.

Logistic Regression in Python with statsmodels

statsmodels is another popular package for statistical modeling and analysis in Python. It provides a variety of tools and algorithms for data exploration, estimation, inference, hypothesis testing, and more. It also has a more statistical approach and output than scikit-learn.

To perform logistic regression with statsmodels, we need to import the Logit class from the statsmodels.formula.api module:

from statsmodels.formula.api import Logit

Then we need to create an instance of this class and pass a formula and a data frame as arguments. The formula specifies the relationship between the dependent variable and the independent variables using a syntax similar to R. The data frame contains the variables used in the formula. For example, if we have a data frame called df with three columns: y (the dependent variable), x1 and x2 (the independent variables), we can write:

model = Logit(formula='y ~ x1 + x2', data=df)

This will create a logistic regression model where y is predicted by x1 and x2.

To train the model on the data, we need to use the fit method without any arguments. This will return a results object that contains various information about the model. For example:

results = model.fit()

To view a summary of the results, we can use the summary method on the results object. This will display a table that contains various statistics about the model, such as coefficients, standard errors, p-values, confidence intervals, log-likelihood, AIC, BIC, and more. For example:

print(results.summary())

This will print something like this:

To make predictions on new data, we need to use the predict method on the results object and pass a data frame that contains the features as an argument. For example, if we have a new data frame called df_new with two columns: x1_new and x2_new (the independent variables), we can write:

y_pred = results.predict(df_new)

What are the Advantages and Disadvantages of Logistic Regression?

Logistic regression has some advantages and disadvantages that we should be aware of before using it for our problems. Some of the advantages are:

It is easy to implement and interpret.
It can handle both numerical and categorical features.
It can provide probabilities for each prediction, which can be useful for decision making and risk assessment.
It can perform well with a small number of observations and features.

Some of the disadvantages are:

It assumes a linear relationship between the features and the log-odds of the outcome, which may not hold in reality.
It can be sensitive to outliers and multicollinearity, which can affect the accuracy and stability of the model.
It can suffer from overfitting or underfitting if the regularization parameter is not chosen properly.
It can only handle binary classification problems or multiclass problems with one-vs-rest or softmax strategies.

What are some of the Assumptions and Limitations of Logistic Regression?

Logistic regression makes some assumptions about the data and the model that we should check and validate before using it for our problems. Some of these assumptions are:

The outcome variable is binary or dichotomous, which means that it has only two possible values (0 or 1).
The features are independent of each other, which means that there is no multicollinearity or correlation among them.
The features have a linear relationship with the log-odds of the outcome variable, which means that increasing or decreasing a feature by one unit changes the log-odds by a constant amount.
The error terms are independent and identically distributed (i.i.d.), which means that they have the same variance and are not correlated with each other or with the features.

Some of these assumptions can be checked by using different methods and techniques such as correlation matrix, variance inflation factor (VIF), scatter plots, residual plots, goodness-of-fit tests, and more.

If some of these assumptions are violated or not met, then logistic regression may not be appropriate or reliable for our problems. In that case, we may need to use some alternative methods or techniques such as:

Transforming or scaling the features to reduce skewness or outliers.
Adding or removing features to avoid multicollinearity or underfitting.
Using polynomial or interaction terms to capture non-linear relationships.
Using regularization techniques to prevent overfitting or underfitting.
Using other classification algorithms such as decision trees, random forests, support vector machines (SVM), neural networks, etc.

Metrics for Classification

Metrics for classification are measures of how well a model can predict the correct class labels for a given set of input data. There are many different metrics for classification, depending on the type and purpose of the problem. Using these four values, we can calculate various metrics that measure different aspects of the model’s performance. Here are some of the most common metrics and their formulas:

Confusion Matrix

Accuracy: This is the ratio of correctly predicted labels to the total number of predictions. It is calculated as:

Accuracy=TP+TN/TP+TN+FP+FN

Precision: This is the ratio of correctly predicted positive labels to the total number of positive predictions. It is calculated as:

Precision=TP/TP+FP

Recall: This is the ratio of correctly predicted positive labels to the total number of actual positive labels. It is calculated as:

Recall=TP/TP+FN

F1-score: This is the harmonic mean of precision and recall. It is calculated as:

F1-score=2×Precision×Recall/ Precision+Recall

ROC curve: This is a plot of the true positive rate (recall) versus the false positive rate (1 — precision) for different threshold values. It shows how well a model can discriminate between the positive and negative classes. To plot a ROC curve, we need to calculate the true positive rate (TPR) and the false positive rate (FPR) for each possible threshold value, and then plot them on a graph. The TPR and FPR are calculated as:

TPR=TP+FN/TP

FPR=FP/FP+TN

AUC score: This is the area under the ROC curve. It is a measure of how well a model can rank the predictions from most likely to be positive to least likely. To calculate the AUC score, we need to use a numerical method such as the trapezoidal rule or Simpson’s rule to approximate the area under the curve.

These are some examples of how to calculate some of the metrics for classification using mathematical formulas. You can learn more about them from these sources: