Mastering Logistic Regression in Python with StatsModels

Vincent Favilla
10 min readJun 4, 2023

--

View the accompanying Colab notebook.

In this tutorial, we’ll explore how to perform logistic regression using the StatsModels library in Python. We’ve previously covered logistic regression using scikit-learn, but StatsModels provides more detailed statistics, which can be useful for understanding the model’s performance.

StatsModels is a Python library built specifically for statistics. It provides a wide range of statistical models, tests, and data exploration tools. In this tutorial, we will focus on using StatsModels for logistic regression.

Importing Libraries and Datasets

First, let’s import the necessary libraries and load the dataset. We’ll use the built-in breast cancer dataset from scikit-learn.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the breast cancer dataset
data = datasets.load_breast_cancer()

Preparing the Dataset

Now, let’s prepare the dataset by splitting it into training and testing sets, removing highly correlated features, and standardizing our data.

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)

# Standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Remove highly correlated features
corr_matrix = pd.DataFrame(X_train).corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]

X_train = pd.DataFrame(X_train).drop(to_drop, axis=1)
X_test = pd.DataFrame(X_test).drop(to_drop, axis=1)

Feature Selection Techniques

In addition to evaluating the model, you can also improve its performance and interpretability by selecting the most relevant features.

The code for recursive feature elimination and regularization is actually quite long using the statsmodels API, but you can find it in my logistic regression Colab notebook. It’s quite neat to see, and the benefit of using statsmodels over scikit-learn is that it provides more detailed statistical information about the model.

Performing Logistic Regression with StatsModels

Now that our dataset is prepared, we can perform logistic regression using StatsModels. First, we need to add a constant term to our features, as StatsModels does not include it by default.

The constant term, also known as the bias or intercept, allows the logistic regression model to have a non-zero output when all the input features are zero. In other words, it represents the baseline output of the model when no input features are influencing the prediction. Including a constant term in the model can improve its flexibility and ability to fit the data.

# Add a constant term to the train and test data
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

# Fit the logistic regression model
logit_model = sm.Logit(y_train, X_train)
result = logit_model.fit()

Interpreting Results

Once the model is fitted, we can view the summary of the results, which includes various statistics that we can use to understand our model:

# Print the summary of the results
print(result.summary())

This outputs a bunch of useful information about our model. Let’s start with the header:

Optimization terminated successfully.
Current function value: 0.034995
Iterations 15
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 455
Model: Logit Df Residuals: 434
Method: MLE Df Model: 20
Date: Sat, 03 Jun 2023 Pseudo R-squ.: 0.9470
Time: 20:35:24 Log-Likelihood: -15.923
converged: True LL-Null: -300.17
Covariance Type: nonrobust LLR p-value: 1.233e-107

Lots of math here! Let’s go over the table.

  • Dep. Variable: The name of the dependent variable, which is the target variable “y” in our case.
  • Model: The type of model used, which is logistic regression (Logit) in our case.
  • Method: The method used to fit the model, which is Maximum Likelihood Estimation (MLE) in our case.
  • Date and Time: The date and time when the model was fitted.
  • converged: A boolean value indicating whether the optimization algorithm has converged or not. Convergence means that the algorithm has found the best possible coefficients for our logistic regression model, and further iterations would not significantly improve the model’s performance.
  • Covariance Type: This term refers to the technique employed for estimating the variability of the coefficients, specifically the covariance matrix of these estimated coefficients. The covariance matrix offers insights into the potential changes in the coefficients if the model were to be fitted on various data samples.
    The term “nonrobust” in our output indicates that the default method was used, which assumes that the errors — the differences between predicted probabilities and actual target values — are independently and identically distributed (i.i.d.).
    Note that StatsModels offers other robust covariance matrix estimators for cases where the i.i.d. assumption is not satisfied; however, for this tutorial, we will adhere to the default nonrobust method.
  • No. Observations: The number of observations (samples) used to fit the model.
  • Df Residuals: The degrees of freedom of the residuals, calculated as the number of observations minus the number of parameters (including the constant term). Degrees of freedom represent the number of independent pieces of information that are used to estimate the parameters of the model.
  • Df Model: The number of parameters in the model, excluding the constant term.
  • Pseudo R-squ.: The pseudo R-squared value is an alternative to the traditional R-squared value used in linear regression. It is a measure of the goodness of fit of the model, ranging from 0 to 1, with higher values indicating a better fit. Pseudo R-squared values are used in logistic regression because the traditional R-squared value is not well-defined for this type of model.
  • Log-Likelihood: The log-likelihood measures how well the model explains the observed data. It is calculated by taking the logarithm of the likelihood function, which represents the probability of observing the data given the model’s parameters. A higher log-likelihood value indicates a better fit.
  • LL-Null: The log-likelihood of the null model, which is a model with no predictors (only a constant term). This value is used to calculate the likelihood ratio test, which compares the goodness of fit of our model to the goodness of fit of the null model.
  • LLR p-value: The p-value of the likelihood ratio test, which compares the log-likelihood of the null model that has no predictors to the log-likelihood of our model. A small p-value (typically less than 0.05) indicates that our model is significantly better than the null model, meaning that at least one of the predictors has a significant effect on the target variable.

Now let’s take a look at the model coefficients and statistics table that appears right below it:

coef   std err         z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
const 2.6317 1.633 1.612 0.107 -0.568 5.832
mean radius -10.6143 5.213 -2.036 0.042 -20.832 -0.397
mean texture -6.6298 3.014 -2.200 0.028 -12.537 -0.722
mean smoothness -5.9281 3.517 -1.686 0.092 -12.820 0.964

This table truncated; we actually get information about all the features in model. Let’s go over each of the columns:

  • coef: The estimated coefficients for each feature in the model. These values represent the change in the log-odds of the target variable for a one-unit increase in the corresponding feature, holding all other features constant.
  • std err: The standard error of the estimated coefficients is a measure that describes the precision of an estimated value in a regression model and represents the uncertainty in these estimates. It indicates how much the estimated coefficient is likely to fluctuate due to the random sampling of data.
    A smaller standard error suggests that the estimated coefficient is more precise and stable, while a larger standard error indicates more uncertainty about the true effect of a feature on the target variable.
    Essentially, the standard error helps us understand the reliability of the estimated coefficients in a regression model and is used to calculate confidence intervals and p-values.
  • z: The z-score of the estimated coefficients, calculated as the coefficient divided by its standard error. The z-score is used to test the null hypothesis that the true coefficient is zero (i.e., the feature has no effect on the target variable). This column isn’t particularly useful for interpretability, but the values are needed to calculate the next column.
  • P>|z|: The p-value associated with the z-score. A small p-value (typically less than 0.05) indicates that the null hypothesis can be rejected, suggesting that the corresponding feature has a significant effect on the target variable.
  • [0.025 0.975]: The 95% confidence interval for the estimated coefficients. This interval represents the range of values within which we can be 95% confident that the true coefficient lies. If the interval does not contain zero, it suggests that the corresponding feature has a significant effect on the target variable.

Be Careful When Interpreting the Coefficients

It’s difficult to accurately interpret the coefficients of a logistic regression. First, it’s important to understand that the coefficients in logistic regression represent the log-odds of the outcome variable, which is not as intuitive as the coefficients in linear regression. However, with some additional steps, we can make these coefficients more interpretable:

1. Exponentiate the coefficients: By exponentiating the coefficients, we can obtain the odds ratios, which are more interpretable measures. The odds ratio represents the multiplicative change in the odds of the outcome variable for a one-unit increase in the predictor variable, holding all other variables constant.

For example, if the coefficient for a predictor variable is 0.5, the odds ratio would be exp(0.5) ≈ 1.65. This means that for every one-unit increase in the predictor variable, the odds of the outcome variable occurring increase by a factor of 1.65, assuming all other variables remain constant.

2. Consider the context: When interpreting the odds ratios, it’s essential to consider the context of the predictor variables and the outcome variable. The odds ratios should be interpreted in the context of the specific problem you are trying to solve or the research question you are trying to answer.

3. Assess the statistical significance: In addition to interpreting the odds ratios, it’s important to assess the statistical significance of the coefficients. This can be done by looking at the p-values or confidence intervals associated with the coefficients. If a coefficient is not statistically significant, it may not be meaningful to interpret its odds ratio.

4. Be cautious with interactions and multicollinearity: When interpreting the coefficients of a logistic regression, it’s important to be cautious about potential interactions between predictor variables and multicollinearity. Interactions can make the interpretation of individual coefficients more complex, while multicollinearity can lead to unstable estimates of the coefficients.

5. Consider feature scaling: If you have performed feature scaling, especially when using regularization techniques, it’s important to take this into account when interpreting the coefficients. The coefficients will be affected by the scaling, and you may need to rescale them back to their original units to make meaningful interpretations.

While interpreting the coefficients of a logistic regression can be more challenging than interpreting those of a linear regression, it is still possible and can provide valuable insights into the relationships between predictor variables and the outcome variable. By following the guidelines above, you can effectively interpret the coefficients of a logistic regression.

Model Evaluation

The purpose of a model is to make predictions, so while it’s important to understand the model summary, what we really care about is how well the model is performing.

To determine this, the key metrics to consider are:

  • Pseudo r-squared: This value indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher value (closer to 1) indicates a better fit of the model.
  • Log-likelihood: This value measures how likely the observed data is under the fitted model. A higher value (closer to 0) indicates a better fit of the model.
  • LLR p-value: This value tests the null hypothesis that all the coefficients of the independent variables in the model are equal to zero. A smaller p-value (typically less than 0.05) indicates that at least one of the independent variables has a significant effect on the dependent variable.
  • Convergence: This indicates whether the model has converged, meaning that the optimization algorithm has found the best-fitting parameters for the model. If the model has not converged, it may not be a reliable representation of the data.

In practice, I often look at the pseudo R-squared first, as I can easily compare it to past models I’ve built or variations of the current one. Nonetheless, it’s important to look at these other metrics as well for additional context.

While these metrics are useful, don’t forget that the true test of a model is how it does on unseen data. We still have test data to predict, so let’s do that next.

To include model evaluation by predicting on X_test, you can follow these steps:

  1. Standardize the X_test data using the same StandardScaler object that was used for X_train.
  2. Remove highly correlated features from X_test using the same ‘to_drop’ list that was used for X_train.
  3. Add a constant term to X_test using the sm.add_constant() function.
  4. Make predictions using the fitted model by calling the result.predict() method on the prepared X_test data.
  5. Calculate evaluation metrics such as accuracy, precision, recall, F1-score, and AUC using the predicted values and the true y_test values.

Assuming you’ve run the previous code, here’s how to obtain our evaluation metrics:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Make predictions on the test data
y_pred = result.predict(X_test)

# Convert predicted probabilities to binary class labels
y_pred_labels = np.round(y_pred)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred_labels)
precision = precision_score(y_test, y_pred_labels)
recall = recall_score(y_test, y_pred_labels)
f1 = f1_score(y_test, y_pred_labels)
roc_auc = roc_auc_score(y_test, y_pred_labels)

print("Accuracy: ", accuracy)
print("Precision:", precision)
print("Recall: ", recall)
print("F1-score: ", f1)
print("ROC-AUC: ", roc_auc)
print()
print(result.summary())

Conclusion

In this tutorial, we’ve explored how to perform logistic regression using the StatsModels library in Python. We covered data preparation, feature selection techniques, model fitting, result interpretation, and model evaluation. By following the steps outlined in this tutorial, you can effectively apply logistic regression to your own datasets and gain valuable insights into the relationships between predictor variables and the outcome variable.

This concludes my series on logistic regression! I hope that you’ve gained a comprehensive understanding of this powerful statistical technique and its applications in various domains. By mastering logistic regression, you can enhance your data analysis skills and make more informed decisions based on the insights derived from your data.

Here’s my complete series on logistic regression:

--

--

Vincent Favilla

I'm a data scientist & AI enthusiast, exploring trends & sharing insights. Passionate about large language models & collaborative learning. Let's grow together!