Linear Regression in Scikit-learn vs Statsmodels

Haree Srinivasan
6 min readMay 19, 2020

--

Introduction

One of the benefits to programming in Python is the vast community and universe of libraries they have created. Those attempting to create linear models in Python will find themselves spoiled for choice. Two of the most popular linear model libraries are scikit-learn’s linear_model and statsmodels.api. While both libraries can successfully fit linear models and calculate predictions, they differ in several ways and are not necessarily means to the same end. In general, scikit-learn is designed for prediction, while statsmodels is more suited for explanatory analysis. In this article, I explore several of the key differences between these two libraries and compare how they calculate ordinary least squares (OLS), their efficiencies, the features they offer, how they perform logistic regressions, and the other linear models they are capable of modeling.

It would not take long for a Python programmer to notice the different syntaxes between the two libraries. In my opinion, scikit-learn’s syntax is more intuitive to someone familiar with object-oriented programming, though statmodels’ syntax is certainly functional. The purpose of this article is to examine these modules in greater depth. Those interested in syntax should examine the documentations, which can be found here:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html#statsmodels.regression.linear_model.OLS

In addition, this article from Kaggle provides examples of both:

https://www.kaggle.com/agailloty/linear-regression-with-sklearn-and-statsmodel

Behind the Scenes

A key difference between the two libraries is how they handle constants. Scikit-learn allows the user to specify whether or not to add a constant through a parameter, while statsmodels’ OLS class has a function that adds a constant to a given array.

Scikit-learn’s LinearRegression class is, per the docstring in the source code, “plain [OLS] (scipy.linalg.lstsq) wrapped as a predictor object.” SciPy computes the solution to the least-squares equation Ax = b. Meanwhile, statsmodels’ OLS class provides two algorithms, chosen by the attribute “methods”: the Moore-Penrose pseudoinverse, the default algorithm and similar to SciPy’s algorithm, and QR factorization. All of these algorithms ultimately return the same result in most cases. What is likely more interesting to most programmers is how quickly each class arrives at this result.

I conducted a brief experiment to the test the time taken by each library. For the experiment, I created seven sets of arrays, for features, targets, and predictions, with random values. The number of rows of each set of arrays increased exponentially from 102 to 108, while the input arrays all had five columns. The prediction array sizes were 10% of the feature and target variables. I also set both linear regression classes to include an intercept. I then fit and calculated predictions for each set of arrays ten times and measured the time taken. I had both classes fit a constant, so the difference in time taken to apply a constant is considered in the experiment. The times taken for each set of arrays were then compared using t-tests via SciPy’s stats library. For a seven hypothesis tests at a significance level of 5%, the Bonferroni-corrected significance level is roughly 0.007.

While scikit-learn is slightly faster than statsmodels for 1,000 or less observations, this difference is not significant per the t-test analysis. Sci-kit learn is significantly faster for datasets with more than 1,000 observations. On my computer, for 108 observations, the largest power of 10 I could reasonably test, scikit-learn is about two and half times as fast as statsmodels. In performing these tests, I did not make full use of scikit-learn’s n_jobs parameter which allows the user to use additional cores to potentially fit models and make predictions even faster. Ultimately, I concluded that scikit-learn was faster than statsmodels at fitting ordinary least squares regressions. The code for the experiment is available in the accompanying Github repository under time_tests.py, while the experiment is carried out in sklearn_statsmodels_time_comp.ipynb.

Visualizations

Anyone who has taken a linear regression course or familiar with statistics programs such as R or Stata will be familiar with a table that looks similar to this:

For some, this visual might conjure nightmares, but to someone who [gasps] enjoyed those linear regression courses, seeing this table feels like meeting an old friend. It contains several key statistics that help ascertain the validity of the fitted model and provides metrics that allow for comparisons with other models. This summary table can be easily obtained through statsmodel’s OLS class; the requisite function can be called after fitting a model.

While much of the information displayed in this summary table can be obtained through scikit-learn’s LinearRegression class, not all of it is readily available through attributes, namely the test statistics and the information criteria metrics, and scikit-learn does not provide a similar visual. While the summary table is useful for explaining models, it relates information that is extraneous for the purposes of prediction. As scikit-learn’s library is designed for prediction, such a summary table is unnecessary.

Logistic Regression

Those of us attempting to use linear regression to predict probabilities often use OLS’s evil twin: logistic regression. Fortunately, both scikit-learn and statsmodels provide functionality for logistic regression. However, running the both models with simply the default parameters return different results. How can this be?

The answer is regularization; scikit-learn regularizes logistic regressions by default. Regularization is helpful when trying to predict as it helps prevent the model from overfitting on its training data. Thus, it is an appropriate default for scikit-learn. This parameter, C, can be effectively turned off by setting it to a high value. Statsmodels does have functionality, fit_regularized(), for regularizing logistic regression.

Both scikit-learn and statsmodels also allow the user to modify the solver (scikit-learn) or method (statsmodels) in which the coefficients are computed. By default, scikit-learn uses lbfgs (limited memory Broyden-Fletcher-Goldfarb-Shanno for those out there interested in algorithm trivia), while statsmodels uses bfgs by default. Statsmodels has more choices when it comes to algorithms, though scikit-learn does allow for some options statsmodels does not.

I also conducted an efficiency experiment similar to the one done for ordinary least squares. I used the same data sizes and methodology as before; the only difference in the experiments is that I tested each library’s logistic regression class. I set the solver to “lbfgs” on both classes and again used t-tests to compare the times. Again, the Bonferroni-corrected significance level is approximately 0.007.

Again, there does not appear to be a large time difference for a small number of observations. For 100,000 observations and higher, statsmodels appears to be faster at fitting and predicting logistic regression. It should be noted that these tests did not change the default value of the n_jobs parameter in sci-kit learn, so perhaps sci-kit learn could be made to fit logistic regressions faster.

Again, the code for the experiment is available in the accompanying Github repository under time_tests.py and the experiment is carried out in sklearn_statsmodels_time_comp.ipynb.

Other Linear Models

Both modules offer linear models beyond OLS and logistic regression. The offerings of each library can be found in the following links:

Scikit-learn: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model

Statsmodels: https://www.statsmodels.org/stable/regression.html (under “Model Classes”)

In general, scikit-learn’s linear models, such as ridge and lasso regressions, are suitable for regularization and prediction. They are designed to prevent the model from overtraining to ultimately provide more accurate predictions. Scikit-learn also has other modules that assist in calculating prediction metrics and cross-validation metrics, both useful tools for prediction. On the other hand, the linear models offered in statsmodels are meant to be implemented when an underlying assumption of OLS is broken. Weighted least squares (WLS), for example, helps correct for heteroskedasticity. These models are useful when performing rigorous statistics.

Conclusion

To summarize some key differences:

· OLS efficiency: scikit-learn is faster at linear regression; the difference is more apparent for larger datasets

· Logistic regression efficiency: employing only a single core, statsmodels is faster at logistic regression

· Visualization: statsmodels provides a summary table

· Solvers/ methods: in general, statsmodels provides a greater variety

· Logistic Regression: scikit-learn regularizes by default while statsmodels does not

· Additional linear models: scikit-learn provides more models for regularization, while statsmodels helps correct for broken OLS assumptions

In general, scikit-learn is designed for machine-learning, while statsmodels is made for rigorous statistics. Both libraries have their uses. Before selecting one over the other, it is best to consider the purpose of the model. A model designed for prediction is best fit using scikit-learn, while statsmodels is best employed for explanatory models. To completely disregard one for the other would do a great disservice to an excellent Python library. I, for one, have room in my heart for more than one linear regression library. Hopefully, all of you do too.

--

--