Efficient leave-one-out cross-validation

Josh Tankard
7 min readJun 10, 2023

--

I always assumed linear regression models utilised gradient descent to converge on the best sets of weights and bias that fit the data at hand. And whilst in many cases this is the case, in it’s purest form, it is actually a linear algebra problem that can be solved with matrix multiplication (if I had paid closer attention in my undergraduate mathematics courses I probably would have figured this out sooner).

Whilst in 99% of situations, there is no need to leave the safety and comfort of Scikit-Learn, I recently came across a specific work-based case where building some regression tools from scratch could actually be beneficial. There are certainly educational benefits of doing so (as I found out for myself) and other situations where some custom functionality is desired. But my motivation was that I had already built-out libraries utilising Numba and whilst I certainly didn’t expect to build more optimized regression tooling that the Scikit-Learn wizards, I did feel that jumping in and out of Numba would be too inefficient.

So in my first attempt, I dutifully built a gradient descent procedure that would converge on optimal weights and bias. But in my testing, whilst performance was “good-enough”, the results deviated minutely from the Scikit-Learn benchmark. I was fairly happy with this attempt given the minor time investment but I did feel something was afoot. So upon a tiny bit more research, I came across the closed form solution for linear regression (and Ridge regression too for that matter).

import numpy as np


# X -> matrix of independent variables: n_samples x n_features
# first column is set as ones for intercept
# y -> target values vector length n_samples

identity = np.identity(X.shape[1])
identity[0][0] = 0 # set top corner to zero as we don't penalize intercept
penalty = identity * alpha # alpha is 0 for regular linear regression

weights = np.linalg.inv(X.T @ X + penalty) @ X.T @ y

Implementation with Numpy to calculate weights (including intercept).

I have to admit, I am not sure if it was hugely satisfying to refresh my basic knowledge of matrix multiplication and be able to derive virtually perfect results against the benchmark, or if there was a tinge of disappointment that the solution is simpler than I had expected and I had spent so long blissfully ignorant of this fact.

Cross-validation

Now time-wise to my surprise, with Numba’s support for parallelization, this implementation for Ridge regression with cross-validation was even quicker in many instances than the benchmark. That is, until it comes to “Leave-One-Out” cross-validation (a special case of k-folds where k is equal to the length of the training data).

Leave-one-out cross-validation visualised
K-folds cross-validation validation (five-folds)

For quite awhile, I have been intrigued by the following statement in Scikit-Learns RidgeCV documentation:

cv : int, cross-validation generator or an iterable, default=None

Determines the cross-validation splitting strategy. Possible inputs for cv are:

- None, to use the efficient Leave-One-Out cross-validation

-integer, to specify the number of folds

Why is the leave-one-out cross-validation earmarked as “efficient” but the regular k-folds is not? Surely, in most cases k-folds will use one or two orders of magnitude fewer “folds” and thus be much faster?

The answer, although out there, was not as easy to find as the linear regression solution earlier — hence writing this post. And to my surprise, there is also a closed form solution to “Ridge regression with leave-one-out cross-validation”.

import numpy as np


# X -> matrix of independent variables: n_samples x n_features
# first column is set as ones for intercept
# y -> target values vector length n_samples
# residuals: predictions of fitted model - y

identity = np.identity(X.shape[1])
identity[0][0] = 0 # set top corner to zero as we don't penalize intercept
penalty = identity * alpha

h = np.diag(X @ np.linalg.inv(X.T @ X + penalty) @ X.T)
squared_errors = (residuals / (1 - h)) ** 2
neg_mean_squared_error = - np.mean(errors)

This example shows how to calculate score for a given l2 penalty, alpha.

The solution, as well as being highly efficient, is also known as “approximate leave-one-out” cross-validation. I suppose the logic is, in many circumstances, leave-one-out is preferable to regular k-folds due to the fact that nearly all the data is utilised in each fold. As such, even with some approximation, a highly efficient leave-one-out can be preferable to a five- or ten-fold cross-validation for hyper-parameter optimisation.

Understanding the closed-form solution

Not fully understanding how this was derived, I thought a little reverse engineering may help.

The first thing I notice is that there is the same expression in both the derivation of the weights and the efficient leave-one-out:

np.linalg.inv(X.T @ X + penalty) @ X.T

What is this term? In regular linear regression we are trying to resolve the following for the weights:

y = w0*1 + w1*x1 + w2*x2 … wn*xn + e

this can be summarized in matrix form as:

y = Xw + e

Where y, w & e are vectors and X is a matrix. One way to show this is setting the residuals, e, to zero and moving X to the other side. But you cannot simply divide by a matrix. And a matrix needs to be square (equal rows and columns) for it to have an inverse. As such we can take the transpose of X (Xt), which will always make a matrix square, before multiplying by the inverse — (Xt.X)^:

(Xt.X)^.Xt.y = (Xt.X)^.Xt.X.w

Now the (Xt.X)^.Xt.X can disappear and we are left with:

w = (Xt.X)^.Xt.y

(Xt.X)^.Xt in this instance is known as the pseudoinverse (which is kind of self-explanatory). Using Numpy: this is equivalent to:

A = np.linalg.inv(X.T @ X) @ X.T

Then the penalty term is added inside the inverse function to penalize larger weights.

Next, inside the diagonal operation, we have X.A which actually gives us a projection matrix (note: if we ignore the penalty, A.X would give us an identity but not for X.A as above). This is described as:

…the influence each response value has on each fitted value.

And then:

The diagonal elements of the projection matrix are the leverages, which describe the influence each response value has on the fitted value for that same observation.

The next step is calculating the error term. Dividing the residuals for each data point by the influence the data point has on its own predicted value means that if a value has a high influence over it’s own predicted value, then not including it in the model will result in a larger error for that data point when trying to predict it.

Putting it back together again we now have:

# pseudoinverse
A = np.linalg.inv(X.T @ X + penalty) @ X.T
# projection/influence matrix
P = X @ A
# leverages: influence of each data point on its own predicted value
h = np.diag(P)
# adjust each residual by the influence that data point has on model performance
errors = residuals / (1 - h)
squared_errors = (errors) ** 2
neg_mean_squared_error = - np.mean(errors)

Leave-one-out (LOO) vs approximate leave-one-out (ALOO)

So now is the time to compare the two. To do so, we will generate some random data and have both implementations choose the best alpha and the score for that alpha.

from sklearn.datasets import make_regression
import numpy as np


n_samples = np.random.randint(100, 250)
n_features = np.random.randint(10, 25)
noise = np.random.randint(1, 10)

X, y = make_regression(n_samples=n_samples,
n_features=n_features,
noise=noise
)
alphas = np.array([.1, .5, 1., 5., 10.])

It should be acknowledged that, although we are creating random data, this hardly covers every regression modelling situation but it should give us a good enough idea. Times are obviously specific to my aging laptop but serve here to give a comparative indication of efficiency.

Results for LOO vs ALOO

Conclusion

The regular implementation of leave-one-out is hardly a slouch but the approximate leave-one-out blows it out the park by about 2 orders of magnitude. The difference with larger data sets will be even greater.

When it comes to alpha selection, they match every time. And the difference in error score appears so minute that the moniker “approximate” seems entirely unfair! That being said, perhaps there are certain data sets that make it more approximate than others.

Link to repo:

jcatankard/NumbaML: Regression modelling implemented in Numba

--

--

Josh Tankard

Based on Barcelona, working remotely. Into data analytics, engineering and experimentation. Personal page: jcatankard.github.io