Demystifying the Math Behind Regression Using Matrices

Jason Hayes
8 min readJan 23, 2023

--

A Step-by-Step Guide

Image by Dall-E 2

In my last post, I mentioned how matrices and vectors can be used to simplify regression and account for any amount of variables for which coefficients need to be found (m1,m2…mn for x1,x2…xn).

Where a vector is a 1-dimensional list of values, a matrix is a multidimensional table of values.

Matrix M is a 2x2 2-dimensional matrix, meaning it has 2 rows and 2 columns.

A matrix can be 3-dimensional, even 4-dimensional. In fact, a matrix can be of infinite dimensions. While it is possible to visualize a 3-dimensional matrix, there is no way to visualize a matrix that has greater than 3 dimensions (obviously).

Image by author

In this post, we will only be working with 2D matrices. In the future, I will cover Neural Networks that work with 3D or even 4D matrices.

Let’s start by transforming our previous multiple regression problem in part 2 from using single variables and vectors to matrices. We’ll combine age and size variables into one matrix called X, and m1 and m2 into one matrix called theta.

Matrix Multiplication

Matrices are multiplied by taking the sum of each index of a row in one matrix multiplied by its corresponding index in a column of another, continuing the pattern across all rows and columns. I know that is very wordy and confusing, so I will demonstrate the process using the animation I put together below. It’s straightforward once you understand the pattern.

Image by author

Easy, right? Take the product of every row/column element pair for a single row and column, then add them together to get your answer. This is called the ‘dot product’. The answer on the right side in the animation is a list of the same equation we have been using (m1x1+m2x2+…+b).

In order for the dot product operation to work, the number of columns in the first matrix must match the number of rows in the second.

Technically, the above animation shows a matrix being multiplied by a vector, but once the pattern is understood, it is easy to expand to multiplying two matrices together. By using matrix multiplication, we can compute the output of every input at once for our multiple regression problem.

row 1 X column 1
row 2 X column 1
row 3 X column 1
row 4 X column 1
row 5 X column 1

What about the bias term?

We could add the bias term to each result after computing the dot product of X*theta:

But It can also be inserted into the X*theta multiplication by adding it to vector theta:

Since the number of columns in the first matrix (X) has to match the number of rows in the second (theta), a column needs to be added to X to make 3. Because the bias term is only added and never scaled (multiplied), A column of 1’s can be added to X, since any number multiplied by 1 is itself.

A keen eye will have noticed I switched the variables. Previously, our equation was mx + b. Now it’s X*theta, putting the X variable in front of theta(or m) instead of behind it. Recall the number of columns in the first matrix has to match the number of rows in the second in order for the dot product pattern to work. The previous method (mx + b) worked because each variable represented only one number or a vector of numbers. Because matrices introduce dimensionality into the equation, the number of rows and columns need to match in order for the multiplication to work.

Notice how much simpler the equation becomes when using matrix form:

The MSE can be written in matrix form:

All the derivatives can be calculated using matrices as well. Follow the math step-by-step below and you will see how the derivatives with respect to each parameter in theta (m1, m2,…, mn, b) can be calculated using matrices.

First, let’s write out all the parameter update equations we solved for in part one and expanded to using vectors in part two, but extend the system of equations to account for any number of variables that need to be tuned (b, m1, m2,…,md):

m1x1+m2x2+…+mdxd+b can be rewritten as X multiplied by theta, as was outlined above:

Since we are attempting to write our update function in matrix form, the output (answer) will also be in matrix form (a table). We can work backward starting with our desired output in order to reverse engineer the input:

The parameters that are being updated (b, m1,…, md) are themselves what is contained in matrix theta and can therefore be rewritten as such:

All the like terms (lambda, 1/n, summation, -2, (y-X times theta)) can be placed outside the matrix since they are common to each row (like factoring an equation):

That leaves us with a one-dimensional matrix of vectors of the x terms (the inputs). It’s tempting to rewrite it as matrix X, but that would lead to a dimensional mismatch and the dot product operation would fail. The x vectors are vectors containing every value of that specific feature of x (indexed as d), not every value across all features for a single input of x (indexed as n). Most likely that is a bit confusing to read and understand, so I will demonstrate using mathematical notation below.

Recall what the original matrix X is and compare that to the matrix above:

Matrix X is size NxD, but the matrix above is DxN. If you look closely, you’ll see the first column of matrix X is the first row in the matrix above it. The same goes for the second, and each column after all the way up to d. In mathematics, this is called the ‘transpose’ of matrix X, and is notated by placing a superscript ‘T’ above the variable name of the matrix:

Now that we have determined that the one-dimensional matrix of feature vectors of x is the transpose of matrix X, it can be substituted into the equation:

Since the dot product is a summation of products across all row/column pairs, the summation is removed when the transpose of matrix X is inserted into the equation because it will be multiplied (dot product) with the outcome of Y-X*Theta. Vector y is rewritten as a one-dimensional matrix Y with shape Nx1 (rows x columns) to match the dimensions of the output of X*theta:

Done! Now the algorithm can account for an infinite amount of inputs across an infinite amount of features. The complete regression algorithm in matrix form is written out below:

Below is the code I wrote to solve the previous multiple regression example using matrices.

import numpy as np

class StandardScaler():

def __init__(self):
self.mean = None
self.standard_deviation = None

def fit(self, data):
self.mean = np.mean(data, axis=0)
self.standard_deviation = np.sqrt(np.mean(np.square(data-self.mean), axis=0))

def transform(self, data):
return (data-self.mean)/(self.standard_deviation + 1e-7)

# Turn variables into matrices and solve
X = np.array([[25,2200],[65,2500],[21,3200],[7,1750],[13,1100]])
Y = np.array([275000,245000,350000,245000,195000]).reshape((-1,1))
Theta = np.ones((X.shape[1]+1,1))
lr = 0.01

# scale data
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

# add a column of ones to X to account for bias term being added to Theta matrix
X = np.column_stack((np.ones((X.shape[0],1)),X))

# run regression algorithm
num_episodes = 500

for i in range(num_episodes):
Yh = X.dot(Theta)
residual = Y-Yh
mse = np.sum((np.square(residual)))/len(Y)
Theta -= (lr/len(Yh))*(-2*X.T.dot(residual))

# make predictions
X_new = np.array([[17,2950],[52,1600]]) # data to make predictions on
X_new = scaler.transform(X_new) # scale data using scale of training data
X_new = np.column_stack((np.ones((X_new.shape[0],1)),X_new)) # add column of ones
preds = X_new.dot(Theta) # make predictions

Now that the regression algorithm is written in matrix form and can account for any number of inputs with any number of features, let’s scale up and use a large database containing actual real estate metrics to estimate the value of a property. I’ll take you from start to finish, going over important concepts such as data pre-processing, including detecting correlation and collinearity, and determining how confident you can be that your model (algorithm) is correct using statistical significance, the coefficient of determination, and cross-validation!

--

--

Jason Hayes

I'm a self-taught programmer and AI enthusiast with a background in Game Design and Game Art and Animation