5 min readJan 17, 2019

Re: 8 ways to do linear regression in python

Thanks a lot for this article. It inspired me to look a bit deeper and investigate other options that were not listed.

Here is my output on an Intel i7 2.8GHz

Overall, it’s fairly similar, but not quite.
I added methods using PyTorch for simple matrix inversion and Moore-Penrose pseudo inverse (pinverse).

PyTorch outperforms the numpy method for simple inversion by ~1.5x, but not for the pinverse version.
linregress basically wins.

Note a few things:

I used the %timeit jupyter magic to time the functions multiple times, to get an average, standard deviation, and a smoother output. The difference with the original post is likely due to noise.
I was a little set off by the test data being used in the original post, since a line even with some added noise should always fit well to a line, so I used a dataset that was a bit more realistic, with an X/Y normal distribution with different scales, which looked like this:

A few other things I discovered while looking into this:

sklearn.LinearRegression actually uses np.linalg.lstsq behind the scenes. It’s a simple wrapper, which has some overhead for simple linear regression.
similarly, statsmodel.OLS uses the Moore-Penrose pseudo inverse method, with some overhead from checking many things prior to running the math, including the ability to use np arrays, pandas DataFrames or simple arrays.
All in all, every single one of these Python method ends up using the Lapack C/C++ library in the end. Moore-Penrose uses an SVD (Singular Value Decomposition, or LU decomposition) using the Lapack getsdd function, and simple inverse uses the getrf function, so the difference is essentially in the prior formatting / checking of the data, and the difference between these 2 Lapack methods.
PyTorch makes better uses of the Lapack library as it is optimized to the C++ API level for the simple inversion, but surprisingly is not much better for pinverse.

I also wanted to test PyTorch methods on GPU, so I ran this same benchmark on a Dual Xeon E2650 2.6GHz with a GTX780 (old GPU, but I don’t have better right now)

Here are the results:

Here we see some interesting things:

The GPU code is super slow on the first run, as it needs to be primed. On a subsequent run, we don’t see this.
The PyTorch pinverse method also fails at 2 millions datapoints: this is due to lack of GPU memory. This method also seems way slower on GPU than CPU, which is odd.
For this simple linear regression, even at 10 millions data points, the speedup from the GPU is overshadowed by the overhead of the CPU to GPU data transfer back and forth.
PyTorch still outperforms numpy for simple matrix inversion
PyTorch pinverse on CPU performs better than on the i7, and better than the numpy version, as well as OLS.

Now, one caveat of this study is that it doesn’t really demonstrate what would happen with multi-linear regression.

When looking at multi-linear regression, things are a bit different; we’re looking at wide matrices and a more complex problem.

So I benchmarked multi-linear regression with a dataset of 153 columns, and variable rows up to 100k (not 10 millions like before for the simple linear regression)

Here are the graphs:

On the Intel i7 2.8GHz (no GPU here) (100k rows x 153 columns)

A few things to note:

When the curves gets to 0, it’s because it errored (It happens with the simple inversion methods, when the matrix was not inversible)
PyTorch matrix inversion is almost 3x faster than the equivalent numpy version on this CPU for multi-regression.
statsmodels.OLS, which uses the Moore-Penrose matrix inverse method, now is similar in time to that method, proof that the overhead when doing a simple regression is overcome when loading up with a more complex problem.
PyTorch pinverse method looks similar here (more about this in a bit) which is surprising considering the difference with the simple inversion.

On the Dual Xeon with GPU, I get:

As before, OLS and Moore-Penrose with numpy converge to give about similar results
However, the PyTorch pinverse version now outperforms the other two by over 3x, and the GPU version is now faster than CPU, altogether more than 4x the numpy version.

Conclusions

First thing is: simple linear regression and multi-linear regressions are different things, so we can’t interpolate the simple results to the multi-regression that easily.
Second, apparently hardware also makes a big difference. Obviously GPU helps for the larger datasets, but even on CPU: while on the i7, OLS and pinverse (whatever implementation) were timing about the same, but on the higher end Xeon, PyTorch is about 3x faster.

So, my final conclusion is: do your own benchmark!

Now, one caveat of this study is that it doesn’t really demonstrate what would happen with multi-linear regression.

Conclusions

Written by Emmanuel