Logistic Regression: An Introduction

8 min readMay 29, 2023

View the accompanying Colab notebook.

Logistic regression is a popular machine learning technique used to predict the probability of an event occurring based on input data. For example, it can be used to predict whether a customer will make a purchase based on their browsing history and demographic information. In this introductory post of my logistic regression series, we’ll explore the basics of logistic regression, discuss its assumptions, and see some examples with actual data.

Logistic Regression vs. Linear Regression

While linear regression is used to model the relationship between predictor variables and a continuous outcome variable, logistic regression is used for binary classification problems, where the outcome variable has only two possible values. Logistic regression models the probability of the outcome occurring given the predictor variables, and classifies the outcome based on a threshold probability value.

Logistic Regression: A Brief Overview

Logistic regression is a type of regression analysis used for predicting binary outcomes. It models the relationship between a set of predictor variables and a binary outcome variable by estimating the probability of the outcome occurring given the predictor variables.

Logit Function

The logit function is the natural logarithm of the odds of the outcome occurring, which is the ratio of the probability of the outcome occurring to the probability of the outcome not occurring. Mathematically, the logit function is defined as:

logit(p) = log(p / (1 - p))

The logit function helps us transform the probability values (ranging from 0 to 1) into a continuous range of values. This is useful because it allows us to use linear regression techniques to model the relationship between predictor variables and the logit of the probability.

Sigmoid Function

The logistic function, also known as the sigmoid function, is the inverse of the logit function. It is used to transform the linear combination of predictor variables into a probability value between 0 and 1. Mathematically, the sigmoid function is defined as:

sigmoid(x) = 1 / (1 + exp(-x))

The sigmoid function takes the linear combination of predictor variables and maps it to a probability value between 0 and 1. This probability value can then be used to classify the outcome based on a threshold probability value (e.g., if the probability is greater than 0.5, classify the outcome as 1, otherwise classify it as 0).

We can plot this the sigmoid function Python as such:

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

x = np.linspace(-10, 10, 100)
y = sigmoid(x)

plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('sigmoid(x)')
plt.title('Sigmoid Function')
plt.grid(True)
plt.show()

Estimating Coefficients

In logistic regression, coefficients represent the relationship between the predictor variables and the logit of the probability of the outcome occurring. The coefficients are estimated using a technique called maximum likelihood estimation, which aims to find the values that maximize the likelihood of the observed data.

The maximum likelihood estimation process involves iteratively updating the coefficients to find the values that maximize the likelihood of the observed data. This is typically done using an optimization algorithm, such as gradient descent or Newton’s method.

Interpreting Coefficients

Once we have estimated the coefficients, we can use them to understand how the predictor variables influence the outcome. In logistic regression, the coefficients tell us how the log-odds of the outcome change when the predictor variable increases by one unit, while keeping all other predictor variables constant.

For example, let’s say we have a coefficient of 0.5 for a predictor variable called “age.” This means that when a person’s age increases by one year, the log-odds of the outcome (e.g., making a purchase) increase by 0.5, assuming all other factors (e.g., browsing history, demographic information) remain the same. The log-odds can then be converted back to probabilities using the sigmoid function, which helps us understand the likelihood of the outcome occurring.

Making Predictions with Logistic Regression

To predict outcomes using logistic regression, we follow these steps:

1. Combine predictor variables and coefficients: We multiply each predictor variable by its corresponding coefficient and add them together. This is called the linear combination of predictor variables and coefficients.

2. Calculate the probability: We use the sigmoid function to convert the linear combination from step 1 into a probability value between 0 and 1. This probability represents the likelihood of the outcome occurring (e.g., making a purchase).

3. Classify the outcome: We decide on a threshold probability value, often 0.5, to classify the outcome. If the probability calculated in step 2 is greater than the threshold, we predict the outcome as 1 (e.g., the customer will make a purchase). If the probability is less than or equal to the threshold, we predict the outcome as 0 (e.g., the customer will not make a purchase).

By following these steps, we can use logistic regression to make predictions for binary outcomes based on the predictor variables and their coefficients.

Example with Actual Data

Consider a logistic regression model that predicts whether a customer will make a purchase based on their age and income. The model estimates the probability of a purchase occurring given the customer’s age and income, and classifies the customer as a buyer or non-buyer based on a threshold probability value.

Suppose we have the following data:

| Age | Income | Purchased |
|-----|--------|-----------|
| 25  | 50000  | 0         |
| 30  | 60000  | 0         |
| 35  | 70000  | 1         |
| 40  | 80000  | 1         |

Using logistic regression, we can fit a model to this data and predict the probability of a purchase for new customers based on their age and income.

To do this, the code in Python is as simple as:

import pandas as pd
from sklearn.linear_model import LogisticRegression

# Load data (assume we have this saved in a csv file)
data = pd.read_csv('customer_data.csv')

# Define features and target variable
X = data[['Age', 'Income']]
y = data['Purchased']

# Fit logistic regression model
model = LogisticRegression()
model.fit(X, y)

# Predict probability of purchase for new customers
new_customer = [[25, 50000]] # example new customer (age 25 & income 50000)

prob_purchase = model.predict_proba(new_customer)[:,1]

print('Probability of purchase:', prob_purchase)

Assumptions of Logistic Regression

Logistic regression makes several key assumptions:

Linear relationship: Assumes a linear relationship between the predictor variables and the logit of the outcome variable.
Independence of predictor variables: Assumes that the predictor variables are independent and that there is no multicollinearity — meaning the predictor variables are not highly correlated with each other.
Binary outcome variable: Assumes that the outcome variable is binary (i.e., it has two possible values).

If any of these assumptions are violated, the model may not perform well. However, methods do exist for handling multiclass logistic regressions (i.e., when the outcome variable has more than two possible values) and for dealing with non-linear relationships between the predictor variables and the outcome variable. I’ll discuss them in a future post.

Model Evaluation and Performance Metrics

When evaluating the performance of a logistic regression model, it’s important to consider metrics beyond just accuracy, as accuracy can be misleading in certain situations, such as imbalanced datasets. Some common performance metrics for logistic regression include:

1. Precision. The proportion of true positive predictions among all positive predictions. Optimize for precision when the cost of false positives is high (e.g., convicting someone of a crime; it’s better to let a guilty person go free than to convict an innocent person).

2. Recall. The proportion of true positive predictions among all actual positives. Optimize for recall when the cost of false negatives is high (e.g., diagnosing a disease; it’s better to diagnose a healthy person as sick than to miss a sick person).

3. F1 score. The harmonic mean of precision and recall, providing a balanced measure of both metrics.

4. Confusion Matrix. A table that shows the number of true positives, true negatives, false positives, and false negatives for a classification model. It helps to visualize the performance of the model and identify any potential issues.

5. ROC-AUC Score. The area under the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate (recall) against the false positive rate. A higher ROC-AUC score indicates better model performance.

If you’re not sure where to start, a good starting point is to calculate the F1 score, which provides a balanced measure of both precision and recall. It’s also worth plotting the confusion matrix to help you visualize the performance of the model and identify any potential issues. Once you’re comfortable with these metrics, you can explore other metrics like ROC-AUC score to further evaluate your model.

Here’s some sample Python code using scikit-learn to calculate the F1 score and plot the confusion matrix:

# Load the dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2)

# Train the logistic regression model
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate F1 score
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1}")
print()

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred, normalize='true')
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot();

By starting with the F1 score and confusion matrix, you can gain a better understanding of your model’s performance and gradually explore other metrics as needed.

ROC-AUC score is particularly useful because it evaluates your model based on the probabilities it assigns to each prediction. Rather judging your model’s binary predictions, it looks at the probability of each prediction and evaluates the model based on how well it can differentiate between classes. This is especially useful when dealing with imbalanced datasets.

Conclusion

Logistic regression has numerous real-world applications, including:

Predicting customer churn: Logistic regression can be used to predict whether a customer will cancel their subscription based on their usage patterns and demographic information.
Medical diagnosis: Logistic regression can be used to predict the presence or absence of a disease based on patient symptoms and test results.
Spam detection: Logistic regression can be used to classify emails as spam or not spam based on the content and metadata of the email.

You’re certain to perform many of them throughout your data science career; they’re favored because they’re relatively simple to implement and interpret, and — perhaps just as importantly — they’re fast to train and make predictions with (especially when compared to more complex models).

In this first part of this series, we’ve explored the basics of logistic regression, discussed its assumptions, and seen a brief example with actual data inside Python. In the next part, we’ll delve into regularization in logistic regression, including L1 and L2 regularization, convexity, and choosing the appropriate regularization technique.

My complete series on logistic regression:

Logistic Regression: An Introduction
Regularization in Logistic Regression
Advanced Techniques in Logistic Regression — Part 1
Advanced Techniques for Logistic Regression and Classification — Part 2
Mastering Logistic Regression in Python with StatsModels
Colab Notebook