Computational Linear Algebra: Linear Regression in Health Sector

Health Outcomes with Linear Regression- Predicting Diabetes in a Patient

Monit Sharma
9 min readJun 26, 2023

When I Tell people that I use python for computational analysis, and they look at me inquisitively. “Isn’t python pretty slow?

They have a point. Python is an interpreted language, and as such cannot natively perform many operations as quickly as a compiled language such as C or Fortran. There is also the issue of the often misunderstood and much-maligned GIL, which calls into question python’s ability to allow true parallel computing.

Many solutions have been proposed:

  1. PyPy is a much faster version of the core python language;
  2. numexpr provides optimized performance on certain classes of operations from within python;
  3. weave allows inline inclusion of compiled C/C++ code;
  4. cython provides extra markup that allows python and/or python-like code to be compiled into C for fast operations.

But a naysayer might point out: many of these “python” solutions in practice are not really python at all, but clever hacks into Fortran or C.

In recent years, new languages like go and julia have popped up which try to address some of these issues. Julia in particular has a number of nice properties.

Enter numba. This is an attempt to bring JIT compilation cleanly to python, using the LLVM framework.

Let’s see it in action:

Pure Python Version

import numpy as np

def pairwise_python(X, D):
M = X.shape[0]
N = X.shape[1]
for i in range(M):
for j in range(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
D[i, j] = np.sqrt(d)

This is very slow. For an array consisting of 1000 points in three dimensions, execution takes over 12 seconds on my machine.


X = np.random.random((1000, 3))
D = np.empty((1000, 1000))
%timeit pairwise_python(X, D)

1 loops, best of 3: 12.1 s per loop

Numba Version

Once numba is installed, we add only a single line to our above definition to allow numba to interface our code with LLVM

import numpy as np
from numba import double
from numba.decorators import jit

@jit(arg_types=[double[:,:], double[:,:]])
def pairwise_numba(X, D):
M = X.shape[0]
N = X.shape[1]
for i in range(M):
for j in range(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
D[i, j] = np.sqrt(d)

I should emphasize that this is the exact same code, except for numba’s jit decorator. The results are pretty astonishing:


X = np.random.random((1000, 3))
D = np.empty((1000, 1000))
%timeit pairwise_numba(X, D)

100 loops, best of 3: 15.5 ms per loop

This is a three order-of-magnitude speedup, simply by adding a numba decorator!

Cython Version

For completeness, let’s do the same thing in cython. Cython takes a bit more than just some decorators: there are also type specifiers and other imports required. Additionally, we’ll use the sqrt function from the C math library rather than from numpy. Here's the code:

cimport cython
from libc.math cimport sqrt

@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise_cython(double[:, ::1] X, double[:, ::1] D):
cdef int M = X.shape[0]
cdef int N = X.shape[1]
cdef double tmp, d
for i in range(M):
for j in range(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
D[i, j] = sqrt(d)

Running this shows about a 30% speedup over numba:

X = np.random.random((1000, 3))
D = np.empty((1000, 1000))
%timeit pairwise_numba(X, D)

100 loops, best of 3: 9.86 ms per loop

So numba is 1000 times faster than a pure python implementation, and only marginally slower than nearly identical cython code. There are some caveats here: first of all, I have years of experience with cython, and only an hour’s experience with numba. I’ve used every optimization I know for the cython version, and just the basic vanilla syntax for numba. There are likely ways to tweak the numba version to make it even faster.

Now let’s get back to our topic.

from sklearn import datasets, linear_model, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
import math, scipy, numpy as np
from scipy import linalg

Diabetes Dataset

We will use a dataset from patients with diabetes. The data consists of 442 samples and 10 variables (all are physiological characteristics), so it is tall and skinny. The dependent variable is a quantitative measure of disease progression one year after baseline.

This is a classic dataset, famously used by Efron, Hastie, Johnstone, and Tibshirani in their Least Angle Regression paper, and one of the many datasets included with scikit-learn.

data = datasets.load_diabetes()
feature_names=['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
trn,test,y_trn,y_test = train_test_split(data.data, data.target, test_size=0.2)
trn.shape, test.shape

((353, 10), (89, 10))

Linear regression in Scikit Learn

Let’s start by using the sklearn implementation:

regr = linear_model.LinearRegression()
%timeit regr.fit(trn, y_trn)

458 µs ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

pred = regr.predict(test)

It will be helpful to have some metrics on how good our prediction is. We will look at the mean squared norm (L2) and mean absolute error (L1).

def regr_metrics(act, pred):
return (math.sqrt(metrics.mean_squared_error(act, pred)),
metrics.mean_absolute_error(act, pred))
regr_metrics(y_test, regr.predict(test))

(75.36166834955054, 60.629082113104403)

Polynomial Features

trn.shape

(353, 10)

Now, we want to try improving our model’s performance by adding some more features. Currently, our model is linear in each variable, but we can add polynomial features to change this.

poly = PolynomialFeatures(include_bias=False)
trn_feat = poly.fit_transform(trn)
', '.join(poly.get_feature_names(feature_names))

‘age, sex, bmi, bp, s1, s2, s3, s4, s5, s6, age², age sex, age bmi, age bp, age s1, age s2, age s3, age s4, age s5, age s6, sex², sex bmi, sex bp, sex s1, sex s2, sex s3, sex s4, sex s5, sex s6, bmi², bmi bp, bmi s1, bmi s2, bmi s3, bmi s4, bmi s5, bmi s6, bp², bp s1, bp s2, bp s3, bp s4, bp s5, bp s6, s1², s1 s2, s1 s3, s1 s4, s1 s5, s1 s6, s2², s2 s3, s2 s4, s2 s5, s2 s6, s3², s3 s4, s3 s5, s3 s6, s4², s4 s5, s4 s6, s5², s5 s6, s6²’

trn_feat.shape

(353, 65)

regr.fit(trn_feat, y_trn)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

regr_metrics(y_test, regr.predict(poly.fit_transform(test)))

(55.747345922929185, 42.836164292252235)

Time is squared in #features and linear in #points, so this will get very slow!

%timeit poly.fit_transform(trn)

635 µs ± 9.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Speeding up feature generation

We would like to speed this up. We will use Numba, a Python library that compiles code directly to C.

Numba is a compiler.

Experiments with vectorization and native code

Let’s first get acquainted with Numba, and then we will return to our problem of polynomial features for regression on the diabetes data set.

%matplotlib inline
import math, numpy as np, matplotlib.pyplot as plt
from pandas_summary import DataFrameSummary
from scipy import ndimage
from numba import jit, vectorize, guvectorize, cuda, float32, void, float64

We will show the impact of:

  • Avoiding memory allocations and copies (slower than CPU calculations)
  • Better locality
  • Vectorization

If we use numpy on whole arrays at a time, it creates lots of temporaries, and can’t use cache. If we use numba looping through an array item at a time, then we don’t have to allocate large temporary arrays, and can reuse cached data since we’re doing multiple calculations on each array item.

# Untype and Unvectorized
def proc_python(xx,yy):
zz = np.zeros(nobs, dtype='float32')
for j in range(nobs):
x, y = xx[j], yy[j]
x = x*2 - ( y * 55 )
y = x + y*2
z = x + y + 99
z = z * ( z - .88 )
zz[j] = z
return zz
nobs = 10000
x = np.random.randn(nobs).astype('float32')
y = np.random.randn(nobs).astype('float32')
%timeit proc_python(x,y)   # Untyped and unvectorized

49.8 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy

Numpy lets us vectorize this:

# Typed and Vectorized
def proc_numpy(x,y):
z = np.zeros(nobs, dtype='float32')
x = x*2 - ( y * 55 )
y = x + y*2
z = x + y + 99
z = z * ( z - .88 )
return z
np.allclose( proc_numpy(x,y), proc_python(x,y), atol=1e-4 )

True

%timeit proc_numpy(x,y)    # Typed and vectorized

35.9 µs ± 166 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Numba

Numba offers several different decorators. We will try two different ones:

  • @jit: very general
  • @vectorize: don't need to write a for loop. useful when operating on vectors of the same size

First, we will use Numba’s jit (just-in-time) compiler decorator, without explicitly vectorizing. This avoids large memory allocations, so we have better locality:

@jit()
def proc_numba(xx,yy,zz):
for j in range(nobs):
x, y = xx[j], yy[j]
x = x*2 - ( y * 55 )
y = x + y*2
z = x + y + 99
z = z * ( z - .88 )
zz[j] = z
return zz
z = np.zeros(nobs).astype('float32')
np.allclose( proc_numpy(x,y), proc_numba(x,y,z), atol=1e-4 )

True

%timeit proc_numba(x,y,z)

6.4 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Now we will use Numba’s vectorize decorator. Numba's compiler optimizes this in a smarter way than what is possible with plain Python and Numpy.

@vectorize
def vec_numba(x,y):
x = x*2 - ( y * 55 )
y = x + y*2
z = x + y + 99
return z * ( z - .88 )
np.allclose(vec_numba(x,y), proc_numba(x,y,z), atol=1e-4 )

True

%timeit vec_numba(x,y)

5.82 µs ± 14.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Numba is amazing. Look how fast this is!

Numba polynomial features

@jit(nopython=True)
def vec_poly(x, res):
m,n=x.shape
feat_idx=0
for i in range(n):
v1=x[:,i]
for k in range(m): res[k,feat_idx] = v1[k]
feat_idx+=1
for j in range(i,n):
for k in range(m): res[k,feat_idx] = v1[k]*x[k,j]
feat_idx+=1

Row-Major vs Column-Major Storage

From this blog post by Eli Bendersky:

“The row-major layout of a matrix puts the first row in contiguous memory, then the second row right after it, then the third, and so on. Column-major layout puts the first column in contiguous memory, then the second, etc…. While knowing which layout a particular data set is using is critical for good performance, there’s no single answer to the question which layout ‘is better’ in general.

“It turns out that matching the way your algorithm works with the data layout can make or break the performance of an application.

“The short takeaway is: always traverse the data in the order it was laid out.”

Column-major layout: Fortran, Matlab, R, and Julia

Row-major layout: C, C++, Python, Pascal, Mathematica

trn = np.asfortranarray(trn)
test = np.asfortranarray(test)
m,n=trn.shape
n_feat = n*(n+1)//2 + n
trn_feat = np.zeros((m,n_feat), order='F')
test_feat = np.zeros((len(y_test), n_feat), order='F')
vec_poly(trn, trn_feat)
vec_poly(test, test_feat)
regr.fit(trn_feat, y_trn)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

regr_metrics(y_test, regr.predict(test_feat))

(55.74734592292935, 42.836164292252306)


%timeit vec_poly(trn, trn_feat)

7.33 µs ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Recall, this was the time from the scikit learn implementation PolynomialFeatures, which was created by experts:

%timeit poly.fit_transform(trn)

635 µs ± 9.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

605/7.7

78.57142857142857

This is a big deal! Numba is amazing! With a single line of code, we are getting a 78x speed-up over scikit learn (which was optimized by experts).

Regularization and noise

Regularization is a way to reduce over-fitting and create models that better generalize to new data.

Regularization

Lasso regression uses an L1 penalty, which pushes towards sparse coefficients. The parameter α is used to weight the penalty term. Scikit Learn’s LassoCV performs cross validation with a number of different values for α.

reg_regr = linear_model.LassoCV(n_alphas=10)
reg_regr.fit(trn_feat, y_trn)

LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
max_iter=1000, n_alphas=10, n_jobs=1, normalize=False, positive=False,
precompute=’auto’, random_state=None, selection=’cyclic’, tol=0.0001,
verbose=False)

reg_regr.alpha_

0.0098199431661591518

regr_metrics(y_test, reg_regr.predict(test_feat))

(50.0982471642817, 40.065199085003101)

Noise

Now we will add some noise to the data

idxs = np.random.randint(0, len(trn), 10)
y_trn2 = np.copy(y_trn)
y_trn2[idxs] *= 10 # label noise
regr = linear_model.LinearRegression()
regr.fit(trn, y_trn)
regr_metrics(y_test, regr.predict(test))

(51.1766253181518, 41.415992803872754)

regr.fit(trn, y_trn2)
regr_metrics(y_test, regr.predict(test))

(62.66110319520415, 53.21914420254862)

hregr = linear_model.HuberRegressor()
hregr.fit(trn, y_trn2)
regr_metrics(y_test, hregr.predict(test))

(51.24055602541746, 41.670840571376822)

--

--