Visualizing logistic regression results using a forest plot in Python

Xavier Eugenio Asuncion
5 min readOct 22, 2021

--

The purpose of this article is to demonstrate how to create a forest plot in Python to visualize the logistic regression results. It’s worth noting that this type of visualization is particularly useful when logistic regression is used for inference rather than prediction. Hence, in this article, we’ll use logistic regression to describe the relationship between the variables rather than to make predictions.

Dataset

The dataset we’ll be using is the Pima Indians Diabetes Dataset from the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset is used to create models that predict the onset of diabetes based on certain diagnostic measurements from 768 observations. It’s a binary classification problem with eight input variables and one output variable.

  1. Number of times pregnant
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  3. Diastolic blood pressure (mm Hg)
  4. Triceps skinfold thickness (mm)
  5. 2-Hour serum insulin (mu U/ml)
  6. Body mass index (weight in kg/(height in m)²)
  7. Diabetes pedigree function
  8. Age (years)
  9. Class variable (0 or 1; class value 1 is interpreted as “tested positive for diabetes”)

Note that this dataset is imbalance. Among the 768 observations, only 268 (34.9%) instances were tested positive for diabetes. But, for purposes of this article, we’ll ignore for now the issue of having an imbalanced data.

The dataset can be downloaded using the code below.

import pandas as pd# column names
var_names = [“Number of times pregnant”, “Plasma glucose concentration”, “Diastolic blood pressure”, “Triceps skinfold thickness”, “2-Hour serum insulin”, “Body mass index”, “Diabetes pedigree function”, “Age”, “Class”]
# download dataset
url='https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
df = pd.read_csv(url, header=None, names=var_names)
df.head(5)
Screenshot of the Pima Indians Diabetes Dataset

Building logistic Regression

Given that this is an inference task, I built a logistic regression model using Python’s statsmodels library.

import statsmodels.api as sm# defining dependent and independent variables
Xtrain = df.iloc[:, :-1]
ytrain = df.iloc[:, -1]

# build the model and fit the data
model = sm.Logit(ytrain, Xtrain).fit()

Here is the model summary after training:

print(model.summary())
Screenshot of the model summary from statsmodels

The coefficients of the model are displayed in the table as log-odds, which can be confusing for some people. As a result, I’ve transformed the coefficients to odds ratios to make the results more intuitive. I’ve also used the p-values to determine whether the variables are statistically significant at a 95% confidence interval. Among the eight variables, only the number of times pregnant, plasma glucose concentration, and diastolic blood pressure are found to be statistically significant.

import numpy as npparams = model.params
conf = model.conf_int()
conf['Odds Ratio'] = params
conf.columns = ['2.5%', '97.5%', 'Odds Ratio']
# convert log odds to ORs
odds = pd.DataFrame(np.exp(conf))
# check if pvalues are significant
odds['pvalues'] = model.pvalues
odds['significant?'] = ['significant' if pval <= 0.05 else 'not significant' for pval in model.pvalues]
odds
Screenshot of the new table containing the model information

Although we can already interpret the findings using all of the information present in the table, it is often times more preferable to visualize the results using a graph. This is beneficial for a better interpretation of the findings, especially when there are multiple variables, as it reduces cognitive load from looking at numbers simultaneously. Additionally, I feel that visualizing this data enables us to more effectively communicate our findings to other people.

Visualizing results

To visualize the logistic regression results, I used a forest plot (or blobbogram). Specifically, the forest plot was used to graphically represent the odds ratios and their corresponding confidence intervals. The markers reflect the odds ratios while the whiskers represent the confidence limits. I used the matplotlib library to generate this plot in Python. We can see that by representing the results with a forest plot, the story of our analysis is now much clearer.

import matplotlib.pyplot as pltplt.figure(figsize=(6, 4), dpi=150)
ci = [odds.iloc[::-1]['Odds Ratio'] - odds.iloc[::-1]['2.5%'].values, odds.iloc[::-1]['97.5%'].values - odds.iloc[::-1]['Odds Ratio']]
plt.errorbar(x=odds.iloc[::-1]['Odds Ratio'], y=odds.iloc[::-1].index.values, xerr=ci,
color='black', capsize=3, linestyle='None', linewidth=1,
marker="o", markersize=5, mfc="black", mec="black")
plt.axvline(x=1, linewidth=0.8, linestyle='--', color='black')
plt.tick_params(axis='both', which='major', labelsize=8)
plt.xlabel('Odds Ratio and 95% Confidence Interval', fontsize=8)
plt.tight_layout()
# plt.savefig('raw_forest_plot.png')
plt.show()
Forest plot showing the odds ratios and the associated 95% confidence intervals

We can further enhance the figure above by including information on the statistical significance of the variable. Thus, I used color to distinguish statistically significant variables (red-colored lines) from those that are not statistically significant (gray-colored lines). This visualization, as shown in the figure below, allows us now to convey a more compelling story.

Forest plot with lines are colored based on statistical significance of the variable

Here is the code to generate plot above:

fig, ax = plt.subplots(nrows=1, sharex=True, sharey=True, figsize=(6, 4), dpi=150)for idx, row in odds.iloc[::-1].iterrows():
ci = [[row['Odds Ratio'] - row[::-1]['2.5%']], [row['97.5%'] - row['Odds Ratio']]]
if row['significant?'] == 'significant':
plt.errorbar(x=[row['Odds Ratio']], y=[row.name], xerr=ci,
ecolor='tab:red', capsize=3, linestyle='None', linewidth=1, marker="o",
markersize=5, mfc="tab:red", mec="tab:red")
else:
plt.errorbar(x=[row['Odds Ratio']], y=[row.name], xerr=ci,
ecolor='tab:gray', capsize=3, linestyle='None', linewidth=1, marker="o",
markersize=5, mfc="tab:gray", mec="tab:gray")
plt.axvline(x=1, linewidth=0.8, linestyle='--', color='black')
plt.tick_params(axis='both', which='major', labelsize=8)
plt.xlabel('Odds Ratio and 95% Confidence Interval', fontsize=8)
plt.tight_layout()
plt.savefig('forest_plot.png')
plt.show()

All the codes in this article can also be found on this Jupyter notebook: link.

Summary

In this article, I showed an alternative to the summary table for presenting the results of logistic regression using a Python package. Basically, the forest plot can be used to graphically show the odds ratios and confidence intervals associated with them. I hope this post has also demonstrated how to use a forest plot to aid in the analysis and visualization of your logistic regression results.

Reference

  1. Dataset: (n.d.). GitHub: Where the world builds software · GitHub. https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv
  2. Dataset description: (n.d.). GitHub: Where the world builds software · GitHub. https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names

--

--