Best Practices for Debugging Errors in Logistic Regression with Python

Optimizing performance using unstructured, real-world data

Published in

Towards Data Science

13 min readNov 13, 2023

Much has been written about the basics of Logistic Regression (LR) — its versatility, time-tested performance, even the underlying math. But knowing how to implement LR successfully and debug inevitable errors is much more challenging to do. That information lives deep in QA websites, academic papers, or simply comes with experience.

The reality is, not every output will be as clean as the iconic iris dataset, swiftly classifying flower types. On large datasets (the datasets you likely use on the job), LR models are likely to encounter problems. Implausibly high coefficients. NaN standard errors. Failing to converge.

What’s to blame? Bad data? Your model?

To organize the landscape a bit, I conducted some research and compiled a list of common LR errors, reasons, and possible solutions.

LR Model output problems, reasons and solutions (Gabe Verzino)

The above table is not exhaustive by any means, but it’s all in one place. Below I will use fabricated data (that’s imbalanced, sparse, categorical… but typical) to artificially re-create these errors and then fix them again.

But first.

Quick Background

As you begin to think about LR, there are a few things unique to these models that you should remember:

LR is technically a probability model, not a classifier
LR requires a linear decision boundary; it assumes linearity between features and target variable, which you can determine with visuals, crosstabs, or ConvexHull analysis from scipy.spatial
You cannot have missing values or major outliers
As the number of features increases, you are more likely to experience multicollinearity and overfitting (fixable with VIF and regularization, respectively)
The most popular LR packages in Python come from sklearn and statsmodels

Now let’s get into some common problems.

Problem 1: Convergence warning, optimization failed

Using the python statsmodels package, you may run a basic LR and get the warning “ConvergenceWarning: Maximum Likelihood optimization failed to converge.”

import statsmodels.formula.api as smf

model = smf.logit("target_satisfaction ~ C(age,Treatment('0-13')) + \
    C(educ,Treatment('Some High School')) + \
    C(emp,Treatment('Not Employed')) + \
    C(gender,Treatment('Male')) + \
    C(hispanic,Treatment('N')) + \
    C(dept,Treatment('Medical Oncology')) + \
    C(sign_status,Treatment('Activated')) + \
    C(mode,Treatment('Internet')) + \
    C(inf,Treatment('Missing'))", data=df_a)
    
results = model.fit(method='bfgs')
print(results.summary())

The coefficients and standard errors may even populate, like normal, including those from LogisticRegression in the sklearn.linear_model package.

Provided the results, you may be tempted to stop here. But don’t. You want to ensure your model converges to produce the best (lowest) cost function and model fit [1].

You can solve this in statsmodels or sklearn by changing the solver/method or increasing the maxiter parameter. In sklearn, the tuning parameter C can also be lowered to apply increased L2 regularization, and can be iteratively tested along a logarithmic scale (100, 10, 1, 0.1, 0.01, etc.) [2].

In my particular case, I increased to maxiter=300000 and the model converged. The reason it worked is because I provided the model (e.g., solvers) more attempts at converging and finding the best parameters [3]. And those parameters, such as the coefficients and p-values, do indeed become more accurate.

Problem 2: Added a feature, but LR outputs didn’t update

This one is easy to miss, but easy to diagnose. If you’re working with many features or didn’t catch it in data cleaning, you may accidentally include a categorical feature in your LR model that is nearly constant or has only one level… bad stuff. The inclusion of such a feature will render coefficients and standard-errors without any warnings.

Here’s our model and output without the bad feature.

model = smf.logit("target_satisfaction ~ C(age,Treatment('0-13')) + \
    C(educ,Treatment('Some High School')) + \
    C(emp,Treatment('Not Employed')) + \
    C(gender,Treatment('Male')) + \
    C(hispanic,Treatment('N')) + \
    C(dept,Treatment('Medical Oncology')) + \
    C(sign_status,Treatment('Activated')) + \
    C(mode,Treatment('Internet')) + \
    C(inf,Treatment('Missing'))", data=df_a)
    
results = model.fit(method='bfgs',maxiter=30000)
print(results.summary())

Now, let’s add a one-level feature type, with its reference category set to ‘Missing’.

model = smf.logit("target_satisfaction ~ C(age,Treatment('0-13')) + \
    C(educ,Treatment('Some High School')) + \
    C(emp,Treatment('Not Employed')) + \
    C(gender,Treatment('Male')) + \
    C(hispanic,Treatment('N')) + \
    C(dept,Treatment('Medical Oncology')) + \
    C(sign_status,Treatment('Activated')) + \
    C(mode,Treatment('Internet')) + \
    C(inf,Treatment('Missing')) + \
    C(type,Treatment('Missing'))", data=df_a) 

results = model.fit(method='bfgs', maxiter=300000)
print(results.summary())

And here we see the impact of that superfluous feature… which was no impact at all. The R-squared value stays the same, highlighted above and below.

Logistic Regression cannot make meaningful estimates on a feature with one level or constant values, and may discard it from the model. This happens at no determinant to the model itself, but still, best practice is to include only the features in the model that are having a positive impact.

A simple value_counts() of our feature type reveals it has one level, indicating you want to drop the feature from your model.

Problem 3: “Inverting Hessian failed” (and NaN standard errors)

This is a big one, and happens a lot. Let’s create this new error by including four more features into our LR model: VA, VB, VC and VI. They’re all categorical, each with 3 levels (0, 1 and “Missing” as a reference).

model = smf.logit("target_satisfaction ~ C(age,Treatment('0-13')) + \
    C(educ,Treatment('Some High School')) + \
    C(emp,Treatment('Not Employed')) + \
    C(gender,Treatment('Male')) + \
    C(hispanic,Treatment('N')) + \
    C(dept,Treatment('Medical Oncology')) + \
    C(sign_status,Treatment('Activated')) + \
    C(mode,Treatment('Internet')) + \
    C(inf,Treatment('Missing')) + \
    C(type,Treatment('Missing')) + \
    C(VA,Treatment('Missing')) + \
    C(VB,Treatment('Missing')) + \
    C(VC,Treatment('Missing')) + \
    C(VI,Treatment('Missing'))", data=df_a) 

results = model.fit(method='bfgs', maxiter=300000)
print(results.summary())

And here is our new error:

Exploring the outputs a bit, we see coefficients rendered but none of the standard errors or p-values. Sklearn provides coefficients as well.

But why the NaN values? And why do you need to invert Hessian, anyway?

Many optimization algorithms use the Hessian inverse (or an estimate of it) to maximize the likelihood function. So when inversion has failed, it means the Hessian (technically, a second derivative of the loglikelihood) is not a positive definite [4]. Visually, some features have huge curvatures while some have smaller curvatures, and the result is the model cannot produce standard errors for the parameters.

More to the point, when a Hessian is not invertible, no computational changes can make it inverse because it simply does not exist [5].

What are the causes of this? There are three most common:

There are more feature variables than observations
The categorical feature(s) have very low frequency levels
Multicollinearity exists between features

Since our dataset has many rows (~25,000) relative to features (14) we can safely ignore the first possibility (though solutions exist for this exact problem [6]). The third possibility may be at play, and we can check that with variance inflation factor (VIF). The second possibility is a bit easier to diagnose, so let’s start there.

Note that features VA, VB, VC and VI are mainly comprised of 1s and 0s.

In particular, the VI feature has a very low (relative) frequency for the ‘Missing’ category, actually on the order of 0.03% (9/24,874).

Let’s say we confirm with our business context that we can collapse 1s and “Missing” together. Or, at the very least, any potential consequences to refactoring data in this way would be far less than accepting a model with known errors (and no standard errors to speak of).

So we created VI_alt, which has 2 levels, and 0 can serve as our reference.

Exchanging VI_alt for VI in the model,

model = smf.logit("target_satisfaction ~ C(age,Treatment('0-13')) + \
    C(educ,Treatment('Some High School')) + \
    C(emp,Treatment('Not Employed')) + \
    C(gender,Treatment('Male')) + \
    C(hispanic,Treatment('N')) + \
    C(dept,Treatment('Medical Oncology')) + \
    C(sign_status,Treatment('Activated')) + \
    C(mode,Treatment('Internet')) + \
    C(inf,Treatment('Missing')) + \
    C(type,Treatment('Missing')) + \
    C(VA,Treatment('Missing')) + \
    C(VB,Treatment('Missing')) + \
    C(VC,Treatment('Missing')) + \
    C(VI_alt,Treatment(0))", data=df_a) 

results = model.fit(method='bfgs', maxiter=300000)
print(results.summary())

There’s a slight improvement to the overall model because it converges without any errors, coefficients render and standard errors now appear. Again, still a poorly-fitted model, but it’s now a working model, which is our aim here.

Our third and final possibility for Hessian failing inversion is multi-collinearity. Now, multi-collinearity is a hot topic in machine learning, I think because (1) it plagues many popular models like LR, linear regression, KNN and Naive Bayes (2) the VIF heuristic for removing features turns out to be more art than science, and (3) ultimately, practitioners disagree about whether or not to even remove such features in the first place if it comes at the expense of injecting selection bias or losing key model information [5–9].

I won’t go down that rabbit hole. It’s deep.

But I’ll show how to calculate VIFs, and maybe we can make our own conclusions.

First, the VIF really asks “How well is one of the features jointly explained by all my other features?” A feature’s VIF estimate will get “inflated” when there is linear dependence with other features.

Given all the features in our dataset, let’s calculate the VIFs below.

from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = results.model.exog
vif = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]

vifs = pd.DataFrame({'variables':results.model.exog_names,'vif':[ '%.2f' % elem for elem in vif ]})
vifs.sort_index(ascending=False).head(14)

Note from this partial list, features VA, VB, VD all show VIFs towards infinity, which far exceeds the “rule of thumb” threshold of 5. But we need to be careful about heuristics like this, as VIF thresholds have two caveats:

The 5 cut-off is relative to other features — e.g., if the large majority of feature VIFs fall below 7 and a smaller subset of feature VIFs are above 7, then 7 could be a more reasonable threshold.
Categorical features with a smaller number of cases in the reference category relative to other levels will produce high VIFs even if those features are not correlated with other variables in the model [8].

In our case, features VA, VB, VC are all highly colinear. Crosstabs confirm this, and a Pearson’s correlation matrix would too if they were continuous variables.

The general consensus for solving this is to systematically drop one feature at a time, review all VIFs, note possible improvements, and continue until all VIFs fall below your selected threshold. Careful attention should be paid to losing any possible explanatory power from the features, relative to both the business context and target variable. Confirmatory statistical tests like chi-square and visualization can help aid in deciding which feature to drop between two possible choices.

Let’s drop VB and note the VIF changes:

model = smf.logit("target_satisfaction ~ C(age,Treatment('0-13')) + \
    C(educ,Treatment('Some High School')) + \
    C(emp,Treatment('Not Employed')) + \
    C(gender,Treatment('Male')) + \
    C(hispanic,Treatment('N')) + \
    C(dept,Treatment('Medical Oncology')) + \
    C(sign_status,Treatment('Activated')) + \
    C(mode,Treatment('Internet')) + \
    C(inf,Treatment('Missing')) + \
    C(type,Treatment('Missing')) + \
    C(VA,Treatment('Missing')) + \
    C(VC,Treatment('Missing')) + \
    C(VI_alt,Treatment(0))", data=df_a) 

results = model.fit(method='bfgs', maxiter=300000)
print(results.summary())

VA and VC still have VIFs at infinity. Even worse, the model is still rendering NaN standard errors (not shown). Let’s drop VC.

model = smf.logit("target_satisfaction ~ C(age,Treatment('0-13')) + \
    C(educ,Treatment('Some High School')) + \
    C(emp,Treatment('Not Employed')) + \
    C(gender,Treatment('Male')) + \
    C(hispanic,Treatment('N')) + \
    C(dept,Treatment('Medical Oncology')) + \
    C(sign_status,Treatment('Activated')) + \
    C(mode,Treatment('Internet')) + \
    C(inf,Treatment('Missing')) + \
    C(type,Treatment('Missing')) + \
    C(VA,Treatment('Missing')) + \
    C(VI_alt,Treatment(0))", data=df_a) 

results = model.fit(method='bfgs', maxiter=300000)
print(results.summary())

Finally, the model produces standard errors, so even though the VIFs of feature VA are still > 5, it’s not a problem because of the second caveat above, the reference category having a small number of cases relative to the other levels.

Extra Credit: Let’s say you absolutely know VA, VB and VC are critical to explaining the target variable and need them in the model. Given the additional features, assume the optimization is working over a complex space, and the “Inverting Hessian failed” may be circumvented by choosing new solvers and starting points (start_params in sklearn). Training a new model that does not assume linear boundaries would also be an option.

Problem 4: “Possibly complete quasi-separation” error (and unreasonably large coefficients and/or standard errors)

We might get excited to see large coefficients and high accuracy scores. But often, these estimates are implausibly large, and are caused by another common issue: perfect separation.

Perfect (or near-perfect) separation occurs when one or more features are strongly associated with the target variable. In fact, they may be almost identical to it.

I can manufacture this error by taking the target variable target_satisfaction and creating a new feature from it called new_target_satisfaction that is 95% similar to it.

df_a['new_target_satisfaction'] = 
pd.concat([pd.DataFrame(np.where(df_a.target_satisfaction.iloc[:1000]==1,0,df_a.target_satisfaction[:1000]),columns=["target_satisfaction"]),
pd.DataFrame(df_a.target_satisfaction.iloc[1000:],columns=['target_satisfaction'])],axis=0)

And put new_target_satisfaction into the model.

model = smf.logit("target_satisfaction ~ C(age,Treatment('0-13')) + \
    C(educ,Treatment('Some High School')) + \
    C(emp,Treatment('Not Employed')) + \
    C(gender,Treatment('Male')) + \
    C(hispanic,Treatment('N')) + \
    C(dept,Treatment('Medical Oncology')) + \
    C(sign_status,Treatment('Activated')) + \
    C(mode,Treatment('Internet')) + \
    C(inf,Treatment('Missing')) + \
    C(type,Treatment('Missing')) + \
    C(VA,Treatment('Missing')) + \
    C(VI_alt,Treatment(0)) + \
    C(new_target_satisfaction,Treatment(0))", data=df_a) 

results = model.fit(method='lbfgs',  maxiter=1000000)
print(results.summary())

Coefficients and standard errors render, but we get this quasi-separation warning:

The feature has a ridiculously high coefficient, and an odds ratio of about 70,000,000, which is implausible.

Under the hood, this is happening because features that separate “too well” create an inflated slope, thus a large coefficient and standard error. Possibly, the LR model may also not converge [10].

Perfect Separation (Cornell Statistical Consulting Unit) [9]

Those two red dots, the cases misclassified removed, would have actually prevented perfect separation, helping the LR model converge and the standard error estimates to be more realistic.

The thing to remember about perfect separation in sklearn is that it can output a model that looks like near-perfect accuracy, in our case 98%, but in reality has a feature new_target_satisfaction that is a near duplicate of the target target_satisfaction.


categorical_features = ['educ','emp','gender','hispanic','dept','sign_status','mode','inf','type',
'VA','VI_alt','new_target_satisfaction']

df1_y = df_a['target_satisfaction']
df1_x = df_a[['educ','emp','gender','hispanic','dept','sign_status','mode','inf','type',
'VA','VI_alt','new_target_satisfaction']]

# create a pipeline for all categorical features
categorical_transformer = Pipeline(steps=[
    ('ohe', OneHotEncoder(handle_unknown='ignore'))])

# create the overall column transformer
col_transformer = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(handle_unknown='ignore'), categorical_features)],
   # ('num', numeric_transformer, numeric_features)],
                                    remainder='passthrough')

lr = Pipeline(steps = [('preprocessor', col_transformer),
                            ('classifier', LogisticRegression(solver='lbfgs',max_iter=200000))])
#['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']

X_train, X_test, y_train, y_test = train_test_split(df1_x, df1_y, test_size=0.2, random_state=101)

lr_model = lr.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)

# to leverage threshold resetting
# THRESHOLD = 0.5
# y_pred = np.where(lr_model.predict_proba(X_test)[:,1] > THRESHOLD, 1, 0)

print(classification_report(y_test, y_pred))

The most common solution would be to simply drop the feature. But there are a growing number of alternative solutions:

Apply Firth’s correction, which maximizes a “penalized” likelihood function. Currently, statsmodels does not have this functionality available but R does [11]
Penalized regression can also work, specifically by testing combinations of solvers, penalties and learning rates [2]. I kept new_target_satisfaction in the model and tried various combinations, but it made little difference in this particular case
Some practitioners will manually swap a few randomly selected observations in the problematic feature to make it less perfectly-separated with the target, like adding back those red circles in the picture above [8, 10]. Running a crosstab on that feature with the target can help you determine what percent of cases to swap. While doing this you might ask yourself Why? Why am I refactoring one feature like this just so the LR model can accept it? To help you sleep better, some research argues perfect separation is only symptomatic of the model, not our data [8, 11]
Finally, some contrarian practitioners actually see nothing wrong with coefficients so high [8]. A very high odds-ratio simply suggests it is strongly suggestive of an association, and is predicting almost perfectly. Caveat the findings and leave it at that. The basis for this argument is high coefficients are an unfortunate consequence of the Wald test and likelihood ratio that require an evaluation of info with an alternative hypothesis

Conclusion

Logistic Regression is most definitely versatile and powerful if you can overcome the challenges from real-world data sets. I hope this overview provides a good foundation of possible solutions. What tip did you find most interesting? What are some others?

Thanks for reading. Share your thoughts in the Comments section, and let me know what other problems you’ve been encountering with Logistic Regression. I’m also happy to connect and talk shop on LinkedIn.

Check out some of my other articles:

Using Bayesian Networks to Forecast Ancillary Service Volume in Hospitals

Why Balancing Classes is Over-Hyped

Feature Engineering CPT Codes

Citations

Allison, Paul. Convergence Failures in Logistic Regression. University of Pennsylvania, Philadelphia, PA. https://www.people.vcu.edu/~dbandyop/BIOS625/Convergence_Logistic.pdf
Geron, Aurelien. Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow. Second Edition. Published by: O’Reilly, 2019.
Sckit-learn documentation, sklearn.linear_model.LogisticRegression. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Google Groups discussion. “MLE error: Warning: Inverting hessian failed: Maybe I cant use ‘matrix’ containers?” 2014. https://groups.google.com/g/pystatsmodels/c/aELcgNpg5f8
Gill, Jeff & King, Gary. What to Do When Your Hessian Is Not Invertible. Alterantives to Model Respecification in Nonlinear Estimation. SOCIOLOGICAL METHODS & RESEARCH, Vol. 33, №1, August 2004 54–87 DOI: 10.1177/0049124103262681. https://gking.harvard.edu/files/help.pdf
“Variable Selection for a binary classification problem.” StackExchange, CrossValidated. Online at: https://stats.stackexchange.com/questions/64271/variable-selection-for-a-binary-classification-problem/64637#64637
“In supervised learning, why is it bad to have correlated features.” StackExchange, Data Science. Online at: https://datascience.stackexchange.com/questions/24452/in-supervised-learning-why-is-it-bad-to-have-correlated-features
“How to deal with perfect separation in logistic regression”. StackExchange, CrossValidated. Online at: https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression
Allison, Paul. “When Can You Safely Ignore Multicollinearity”. Statistical Horizons, September 10, 2012. Online at: https://statisticalhorizons.com/multicollinearity/
Cornell Statistical Consulting Unit. Separation and Convergence Issues in Logistic Regression. Statnews #82. Published: February, 2012. Online at: https://cscu.cornell.edu/wp-content/uploads/82_lgsbias.pdf
Logistf: Firth’s Bias-Reduced Logistic Regression. R documentation. Published August 18, 2023. Online at: https://cran.r-project.org/web/packages/logistf/index.html

Note: All images are by the author, unless otherwise stated.