Linear regression parameter confidence intervals

Josh Tankard
5 min readJun 14, 2023

Sometimes when we build linear regression models, we would like to know the confidence intervals for our intercepts and coefficients. This could be for the following reasons and more:

  1. Estimation: provide a range of plausible values for the coefficients, helping to quantify uncertainty in the estimates.
  2. Hypothesis testing: Non-zero intervals indicate a significant relationship between variables.
  3. Model evaluation: Interval width reflects the precision of the coefficients, aiding in model selection.
  4. Comparison: Overlapping intervals suggest similar effects, while non-overlapping intervals indicate significant differences between predictors.

Scikit-Learn does not include confidence intervals by default but statsmodels does. So for my own regression model implementation using Numba, I will follow the statsmodel approach as my benchmark.

Following the statsmodel library isn’t super straightforward, to me at least, so I have sifted through, working backwards, and hopefully found the key elements.

Finding the solution

In the documentation it says:

The confidence interval is based on Student’s t-distribution.

In statistics, confidence intervals for the mean (or other statistic) is given as:

mean ± SE * z_alpha

Z_alpha is the critical value based on confidence interval. But is the instances of small sample size (as can sometimes be the case in regression modelling) we use the Student’s t-distribution as an approximation. SE is the standard error of the mean, i.e. the deviation of the sample means.

There is no further details but I come across the source code here and the only addition to this is pretty much:

ci = super(RegressionResults, self).conf_int(alpha=alpha, cols=cols)

Not very helpful! But I have found the library for the RegressionResults class on GitHub which now gives us:

lower = params - q * bse
upper = params + q * bse

This is starting to resemble the definition we gave earlier — so let’s dive a little deeper. We can see that q represents the Student’s t-value coming from:

dist = stats.t
q = dist.ppf(1 - alpha / 2, df_resid)

Where stats.t being the SciPy package for the Student t random variable. Alpha relates to the significance level, so a value of 0.05 equates to a two-tail 95% significance level. Not sure what df_resid is so I go back to the statsmodel site and see that it refers to:

The residual degree of freedom.

The dof is defined as the number of observations minus the rank of the regressor matrix.

Here number of observations refers to rows of data in the training set. For the rank of the matrix, I have found this helpful definition:

Why Find the Rank?

The rank tells us a lot about the matrix.

It is useful in letting us know if we have a chance of solving a system of linear equations: when the rank equals the number of variables we may be able to find a unique solution.

So essentially rank is telling us how many features are “linearly independent” and not just transformations of other features (e.g. one feature is 2x of another). In this case, for most well-formed regression data, rank will simply equal the number of features.

We know that the higher the degrees of freedom the closer the Student t-distribution resembles the normal distribution. So it makes sense that for more observations and fewer features to result in higher degrees of freedom.

Now we should figure out what is “bse”. In the documentation it is referred to as:

The standard errors of the parameter estimates.

This is calculated in the library as follows.

np.sqrt(np.diag(self.cov_params()))

Here cov_params is:

The variance/covariance matrix can be of a linear contrast of the
estimated parameters or all params multiplied by scale which will
usually be an estimate of sigma². Scale is assumed to be a scalar.

In a variance/covariance matrix; the diagonal values (Cjj) represent variance of the jth elements and the rest (Cij), represent the covariance between the ith and jth elements. Here, np.diag only takes the diagonal elements (Cjj) so we are only interested in variance of each feature rather than their covariance with other features for calculating the standard error.

What is scale?:

scale will typically be the mean square error from the estimated model (sigma²)

However, looking a bit more closely at the code, we can see that the residual mean squared error is used. This is calculated like the mean squared error but:

The sum of squared residuals divided by the residual degrees of freedom.

Solution

So now we have all the parts of the puzzle. Let’s put it together:

from scipy import stats
import numpy as np

alpha # the significance level
X # regression matrix with intercept, size (n_samples, n_features)
y # target vector, size n_samples
y_fitted # fitted values
params # fitted intercept + coefficient weights

n_samples = X.shape[0]
rank = np.linalg.matrix_rank(X)
df_resid = n_samples - rank # degrees of freedom

residuals = y_fitted - y
scale = np.sum(residuals ** 2) / df_resid # resid mse

C = np.linalg.inv(X.T @ X) * scale # covar matrix
v = np.diag(C) # parameter variances
bse = np.sqrt(v) # standard errors

q = stats.t.ppf(1 - alpha / 2, df_resid) # t-value
lower = params - q * bse
upper = params + q * bse

Results

Now let’s verify this implementation compared to statsmodels:

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import statsmodels.api as sm
from scipy import stats
import pandas as pd
import numpy as np

# create random data
X, y = make_regression(n_samples=150, n_features=15, noise=10)
significance_level = .05

# fit statsmodels
sm_model = sm.OLS(y, sm.add_constant(X))
results = sm_model.fit()
sm_ci = results.conf_int(alpha=significance_level)

# fit scikit-learn
sk_model = LinearRegression(fit_intercept=True)
sk_model.fit(X, y)
y_fitted = sk_model.predict(X)

# add intercept
intercept = np.ones(shape=(X.shape[0], 1), dtype=np.float64)
X = np.concatenate([intercept, X], axis=1)

# residuals degrees of freedom
n_samples = X.shape[0]
rank = np.linalg.matrix_rank(X)
df_resid = n_samples - rank

# t-value
q = stats.t.ppf(1 - significance_level / 2, df_resid)

# scale
residuals = y_fitted - y
scale = np.sum(residuals ** 2) / df_resid

# standard errors
C = np.linalg.inv(X.T @ X) * scale
v = np.diag(C)
bse = np.sqrt(v)

# prepend intercept to coeffs
params = np.r_[m.intercept_, m.coef_]

labels = ['intercept'] + list(map(lambda c: 'coef_' + str(c), range(1, X.shape[1])))

results = pd.DataFrame(
{'label': labels,
'lower_bound': params - q * bse,
'params': params + q * bse,
'upper_bound': upper,
'lower_statsmodels': sm_ci[:, 0],
'upper_statsmodels': sm_ci[:, 1]
}
)
Results match statsmodels to the given level of precision

Link to GitHub: jcatankard/NumbaML: Regression modelling implemented in Numba

--

--

Josh Tankard

Based on Barcelona, working remotely. Into data analytics, engineering and experimentation. Personal page: jcatankard.github.io