Computational Linear Algebra: How to implement Linear Regression

How to Implement Linear Regression from Linear Algebra point of view?

Monit Sharma
9 min readJun 27, 2023

In the previous lesson, we calculated the least squares linear regression for a diabetes dataset, using scikit learn’s implementation. Today, we will look at how we could write our own implementation.

from sklearn import datasets, linear_model, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
import math, scipy, numpy as np
from scipy import linalg
np.set_printoptions(precision=6)
data = datasets.load_diabetes()
feature_names=['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
trn,test,y_trn,y_test = train_test_split(data.data, data.target, test_size=0.2)
trn.shape, test.shape

((353, 10), (89, 10))

def regr_metrics(act, pred):
return (math.sqrt(metrics.mean_squared_error(act, pred)),
metrics.mean_absolute_error(act, pred))

How did sklearn do it?

How is sklearn doing this? By checking the source code, you can see that in the dense case, it calls scipy.linalg.lstqr, which is calling a LAPACK method:

  • gelsd: uses SVD and a divide-and-conquer method
  • gelsy: uses QR factorization
  • gelss: uses SVD

Scipy Sparse Least Squares

We will not get into too much detail about the sparse version of least squares. Here is a bit of info if you are interested:

Scipy sparse lsqr uses an iterative method called Golub and Kahan bidiagonalization.

from scipy sparse lsqr source code:

Preconditioning is another way to reduce the number of iterations. If it is possible to solve a related system M*x = b efficiently, where M approximates A in some helpful way (e.g. M - A has low rank or its elements are small relative to those of A), LSQR may converge more rapidly on the system A*M(inverse)*z = b, after which x can be recovered by solving M*x=z .

If A is symmetric, LSQR should not be used! Alternatives are the symmetric conjugate-gradient method (cg) and/or SYMMLQ. SYMMLQ is an implementation of symmetric cg that applies to any symmetric A and will converge more rapidly than LSQR. If A is positive definite, there are other implementations of symmetric cg that require slightly less work per iteration than SYMMLQ (but will take the same number of iterations).

linalg.lstqr

The sklearn implementation handled adding a constant term (since the y-intercept is presumably not 0 for the line we are learning) for us. We will need to do that by hand now:

trn_int = np.c_[trn, np.ones(trn.shape[0])]
test_int = np.c_[test, np.ones(test.shape[0])]

Since linalg.lstsq lets us specify which LAPACK routine we want to use, lets try them all and do some timing comparisons:

%timeit coef, _,_,_ = linalg.lstsq(trn_int, y_trn, lapack_driver="gelsd")

290 µs ± 9.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit coef, _,_,_ = linalg.lstsq(trn_int, y_trn, lapack_driver="gelsy")

140 µs ± 91.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit coef, _,_,_ = linalg.lstsq(trn_int, y_trn, lapack_driver="gelss")

199 µs ± 228 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Naive Solution

def ls_naive(A, b):
return np.linalg.inv(A.T @ A) @ A.T @ b
%timeit coeffs_naive = ls_naive(trn_int, y_trn)

45.8 µs ± 4.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

coeffs_naive = ls_naive(trn_int, y_trn)
regr_metrics(y_test, test_int @ coeffs_naive)

(57.94102134545707, 48.053565198516438)

Normal Equations (Cholesky)

A = trn_int
b = y_trn
AtA = A.T @ A
Atb = A.T @ b

Warning: Numpy and Scipy default to different upper/lower for Cholesky

R = scipy.linalg.cholesky(AtA)
np.set_printoptions(suppress=True, precision=4)
R
np.linalg.norm(AtA - R.T @ R)

4.5140158187158533e-16

w = scipy.linalg.solve_triangular(R, Atb, lower=False, trans='T')

It’s always good to check that our result is what we expect it to be: (in case we entered the wrong params, the function didn’t return what we thought, or sometimes the docs are even outdated)

np.linalg.norm(R.T @ w - Atb)

1.1368683772161603e-13

coeffs_chol = scipy.linalg.solve_triangular(R, w, lower=False)
np.linalg.norm(R @ coeffs_chol - w)

6.861429794408013e-14

def ls_chol(A, b):
R = scipy.linalg.cholesky(A.T @ A)
w = scipy.linalg.solve_triangular(R, A.T @ b, trans='T')
return scipy.linalg.solve_triangular(R, w)
%timeit coeffs_chol = ls_chol(trn_int, y_trn)

111 µs ± 272 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

coeffs_chol = ls_chol(trn_int, y_trn)
regr_metrics(y_test, test_int @ coeffs_chol)

(57.9410213454571, 48.053565198516438)

QR Factorization

def ls_qr(A,b):
Q, R = scipy.linalg.qr(A, mode='economic')
return scipy.linalg.solve_triangular(R, Q.T @ b)
%timeit coeffs_qr = ls_qr(trn_int, y_trn)

205 µs ± 264 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

coeffs_qr = ls_qr(trn_int, y_trn)
regr_metrics(y_test, test_int @ coeffs_qr)

(57.94102134545711, 48.053565198516452)

SVD

SVD gives the pseudo-inverse

def ls_svd(A,b):
m, n = A.shape
U, sigma, Vh = scipy.linalg.svd(A, full_matrices=False, lapack_driver='gesdd')
w = (U.T @ b)/ sigma
return Vh.T @ w
%timeit coeffs_svd = ls_svd(trn_int, y_trn)

1.11 ms ± 320 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit coeffs_svd = ls_svd(trn_int, y_trn)

266 µs ± 8.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

coeffs_svd = ls_svd(trn_int, y_trn)
regr_metrics(y_test, test_int @ coeffs_svd)

(57.941021345457244, 48.053565198516687)

Random Sketching Technique for Least Squares Regression

Linear Sketching (Woodruff)

  1. Sample a r x n random matrix S, r << n
  2. Compute S A and S b
  3. Find exact solution x to regression SA x = Sb

Timing Comparison

import timeit
import pandas as pd
def scipylstq(A, b):
return scipy.linalg.lstsq(A,b)[0]
row_names = ['Normal Eqns- Naive',
'Normal Eqns- Cholesky',
'QR Factorization',
'SVD',
'Scipy lstsq']

name2func = {'Normal Eqns- Naive': 'ls_naive',
'Normal Eqns- Cholesky': 'ls_chol',
'QR Factorization': 'ls_qr',
'SVD': 'ls_svd',
'Scipy lstsq': 'scipylstq'}
m_array = np.array([100, 1000, 10000])
n_array = np.array([20, 100, 1000])
index = pd.MultiIndex.from_product([m_array, n_array], names=['# rows', '# cols'])
pd.options.display.float_format = '{:,.6f}'.format
df = pd.DataFrame(index=row_names, columns=index)
df_error = pd.DataFrame(index=row_names, columns=index)
# %%prun
for m in m_array:
for n in n_array:
if m >= n:
x = np.random.uniform(-10,10,n)
A = np.random.uniform(-40,40,[m,n]) # removed np.asfortranarray
b = np.matmul(A, x) + np.random.normal(0,2,m)
for name in row_names:
fcn = name2func[name]
t = timeit.timeit(fcn + '(A,b)', number=5, globals=globals())
df.set_value(name, (m,n), t)
coeffs = locals()[fcn](A, b)
reg_met = regr_metrics(b, A @ coeffs)
df_error.set_value(name, (m,n), reg_met[0])
df
df_error
store = pd.HDFStore('least_squares_results.h5')
store['df'] = df

Notes

I used the magick %prun to profile my code.

Alternative: least absolute deviation (L1 regression)

  • Less sensitive to outliers than least squares.
  • No closed form solution, but can solve with linear programming

Conditioning & stability

Condition Number

Condition number is a measure of how small changes to the input cause the output to change.

Question: Why do we care about behavior with small changes to the input in numerical linear algebra?

The relative condition number is defined by

Conditioning: perturbation behavior of a mathematical problem (e.g. least squares)

Stability: perturbation behavior of an algorithm used to solve that problem on a computer (e.g. least squares algorithms, householder, back substitution, gaussian elimination)

Conditioning example

The problem of computing eigenvalues of a non-symmetric matrix is often ill-conditioned

A = [[1, 1000], [0, 1]]
B = [[1, 1000], [0.001, 1]]
wA, vrA = scipy.linalg.eig(A)
wB, vrB = scipy.linalg.eig(B)
wA, wB

(array([ 1.+0.j, 1.+0.j]),
array([ 2.00000000e+00+0.j, -2.22044605e-16+0.j]))

Condition Number of a Matrix

Loose ends from last time

Full vs Reduced Factorization

SVD

Diagrams from Trefethen:

QR Factorization exists for ALL matrices

Just like with SVD, there are full and reduced versions of the QR factorization.

Matrix Inversion is Unstable

from scipy.linalg import hilbert


n = 14
A = hilbert(n)
x = np.random.uniform(-10,10,n)
b = A @ x
A_inv = np.linalg.inv(A)
np.linalg.norm(np.eye(n) - A @ A_inv)

5.0516495470543212

np.linalg.cond(A)

2.2271635826494112e+17

A @ A_inv
row_names = ['Normal Eqns- Naive',
'QR Factorization',
'SVD',
'Scipy lstsq']

name2func = {'Normal Eqns- Naive': 'ls_naive',
'QR Factorization': 'ls_qr',
'SVD': 'ls_svd',
'Scipy lstsq': 'scipylstq'}
pd.options.display.float_format = '{:,.9f}'.format
df = pd.DataFrame(index=row_names, columns=['Time', 'Error'])
for name in row_names:
fcn = name2func[name]
t = timeit.timeit(fcn + '(A,b)', number=5, globals=globals())
coeffs = locals()[fcn](A, b)
df.set_value(name, 'Time', t)
df.set_value(name, 'Error', regr_metrics(b, A @ coeffs)[0])

SVD is best here!

DO NOT RERUN

df

Another reason not to take inverse

Even if A is incredibly sparse, A^(−1) is generally dense. For large matrices, A^(−1) could be so dense as to not fit in memory.

Why Cholesky Factorization is Fast:

A Case Where QR is the Best

m=100
n=15
t=np.linspace(0, 1, m)
# Vandermonde matrix
A=np.stack([t**i for i in range(n)], 1)
b=np.exp(np.sin(4*t))

# This will turn out to normalize the solution to be 1
b /= 2006.787453080206
from matplotlib import pyplot as plt
%matplotlib inline

plt.plot(t, b)

Check that we get 1:

1 - ls_qr(A, b)[14]

1.4137685733217609e-07

Bad condition number:

kappa = np.linalg.cond(A); kappa

5.827807196683593e+17

row_names = ['Normal Eqns- Naive',
'QR Factorization',
'SVD',
'Scipy lstsq']

name2func = {'Normal Eqns- Naive': 'ls_naive',
'QR Factorization': 'ls_qr',
'SVD': 'ls_svd',
'Scipy lstsq': 'scipylstq'}
pd.options.display.float_format = '{:,.9f}'.format
df = pd.DataFrame(index=row_names, columns=['Time', 'Error'])
for name in row_names:
fcn = name2func[name]
t = timeit.timeit(fcn + '(A,b)', number=5, globals=globals())
coeffs = locals()[fcn](A, b)
df.set_value(name, 'Time', t)
df.set_value(name, 'Error', np.abs(1 - coeffs[-1]))
df

The solution for least squares via the normal equations is unstable in general, although stable for problems with small condition numbers.

Low-rank

m = 100
n = 10
x = np.random.uniform(-10,10,n)
A2 = np.random.uniform(-40,40, [m, int(n/2)]) # removed np.asfortranarray
A = np.hstack([A2, A2])
A.shape, A2.shape

((100, 10), (100, 5))

b = A @ x + np.random.normal(0,1,m)
row_names = ['Normal Eqns- Naive',
'QR Factorization',
'SVD',
'Scipy lstsq']

name2func = {'Normal Eqns- Naive': 'ls_naive',
'QR Factorization': 'ls_qr',
'SVD': 'ls_svd',
'Scipy lstsq': 'scipylstq'}
pd.options.display.float_format = '{:,.9f}'.format
df = pd.DataFrame(index=row_names, columns=['Time', 'Error'])
for name in row_names:
fcn = name2func[name]
t = timeit.timeit(fcn + '(A,b)', number=5, globals=globals())
coeffs = locals()[fcn](A, b)
df.set_value(name, 'Time', t)
df.set_value(name, 'Error', regr_metrics(b, A @ coeffs)[0])
df

Comparison

Our results from above:

df

Normal equations/Cholesky is fastest when it works. Cholesky can only be used on symmetric, positive definite matrices. Also, normal equations/Cholesky is unstable for matrices with high condition numbers or with low-rank.

Linear regression via QR has been recommended by numerical analysts as the standard method for years. It is natural, elegant, and good for “daily use”.

END!

--

--