Demystifying the Math Behind Multiple Regression

12 min readJan 15, 2023

A Step-by-Step Guide

In part one, I covered the math of Linear Regression and how it can be used to predict the output given one input (y= mx + b), as long as the input has a linear relationship with the output. If multiple inputs (features) exist, we add mx terms to the equation.

For example, if we add a second input into our equation for a line, the equation goes from this:

to this:

and no longer strictly represents a line, as we will discover when solving the example below.

One more variable was added, which means one more variable needs to be tuned, which means one more variable needs its derivative calculated. It is very easy to add the derivation of the new variable, and any further variables, as there is a pattern in the calculation that allows us to unlock the power of vector and matrix multiplication.

Let’s define our problem:

We have to set up the equation we are solving for, the house price, using the age and size of the home.

The loss function remains the same (MSE).

A vector is an array (list) of numbers and is notated in mathematics using the ‘→’ symbol above a variable. This means the variable no longer represents a single number, but a list of numbers. When solving the derivatives below, I will use the ‘→’ symbol to indicate I am referring to the entire vector of values instead of just a single index of the variable. The math doesn’t change, but it will clean up the notation of equations by replacing the indexing of variables using a subscript with vector notation.

For example:

Solve for the derivatives of m1, m2, and b so they can be updated after each iteration.

Use the Power Rule to solve for the derivative of vector k with respect to vector j.

Solve for the derivative of vector j with respect to m1.

Multiply the two derivatives together (chain rule) to solve for the derivative of k w.r.t. m1.

Write out vector j in terms of what it equals, y-(m1x1+m2x2+b).

Move the -x1 to the front of the equation for simplicity.

Now solve for the derivative with respect to m2:

Use the Power Rule to solve for the derivative of k with respect to j:

Solve for the derivative of j with respect to m2:

Multiply the two derivatives together (chain rule) to solve for the derivative of k w.r.t. m2:

Write out j in terms of what it equals, y-(m1x1+m2x2+b):

Move the -x2 to the front of the equation for simplicity:

Now solve with respect to b:

Don’t forget the summations we moved outside the derivations.

For m1:

For m2:

And b:

Write out the complete regression algorithm:

Before we begin plugging in numbers, we need to discuss a concept called ‘Feature Scaling’. Without it, finding the values of m1,m2 and b that minimize the MSE using gradient descent would be impossible.

Feature Scaling

The scales of variables matter, particularly when multiple variables are on different scales. Take, for example, our house valuation problem. The age of the houses ranges from 7 to 65, and the house size ranges from 1100 to 3200. Since the house size is on a larger scale than the age, It carries more ‘weight’ and will contribute more to the output (answer/y), which also means it will contribute more to the cost (error). Following the logic trail, if the house size (x2) contributes more to the output and cost, the derivative (rate of change/slope) of the cost with respect to the house size coefficient (m2) will always be larger because it is multiplying a larger number. This imbalance causes gradient descent to focus its correction of coefficients on m2 and somewhat ignores m1 and b. Because gradient descent relies on stepping down the gradient towards a minimum, if the calculation of the gradient is being dominated by a coefficient (like m2), the gradient will focus on correcting solely that coefficient, meaning it won’t point towards the minimum of the entire cost function. The algorithm becomes unstable and the cost increases indefinitely.

If variables on different scales have the potential to make gradient descent unstable, it stands to reason that the most stable and precise implementation of gradient descent would be if all variables were on the same scale. That is accomplished through feature scaling. There are a number of ways to scale features, but for this solution, we are going to use a technique called Standardization.

Standardization, also known as Z-score, is a process that centers a dataset at 0 with a standard deviation of 1. This is called a Gaussian distribution and is how much data is naturally distributed (height, weight, IQ, etc…).

The formula for standardizing a dataset is:

the standard deviation measures how spread out the data is from the mean. The lower the number, the closer to the center (mean) the data in the dataset appears. The higher the number, the further away from the center the data appears.

Let’s walk through standardizing both our house age and size data. We’ll start with standardizing the age data:

Apply the standardization formula to each element in the house age dataset:

Standardizing the house size data:

Apply the standardization formula to each element in the house size dataset:

Now that each input is on the same scale, they can be fed into our algorithm, and gradient descent will effectively minimize the cost.

Continuing with our algorithm, we can now begin plugging in numbers to find a solution. Instead of using our original values of x1 and x2, we have to use the scaled values we calculated above, which will be notated as z1 and z2. Also, for simplicity, I’ll set the initial values of m1, m2, and b to 1. Like we did previously, I’ll walk you through the first 3 iterations of the math, then provide some graphs of data showing how the algorithm has performed over many iterations.

First iteration:

Calculate y^

Calculate MSE (optional)

Update m1

Update m2

Update b

Second iteration:

Calculate y^

Calculate MSE (optional)

Update m1

Update m2

Update b

Figure 1 below shows the MSE decreasing at a steady rate over the first 3 iterations using the scaled data. Figure 2 shows the MSE exploding upward if we don’t scale the data.

A contour plot is a 2d plot that simulates a three-dimensional axis (z-axis) by drawing barriers between regions of values representing the z-axis value as the x-axis and y-axis values change. It is much easier to understand a contour plot by seeing it. Below are 2 contour plots. Like above, Figure 3 shows gradient descent working correctly. Each arrow represents 100 iterations (500 total) and shows corrections in the direction of the minimum of the cost function (MSE) indicated by ‘colder’ (blue/lower value) regions. Contrast that with the contour plot in Figure 4, showing arrows pointing both towards and away from the minimum. Notice the scale of the M1 and M2 values. 1e45 and 1e47 are 1 followed by 45 and 47 zeros respectively. Very large corrections were being made in both directions, indicating gradient descent was unable to determine where the minimum was which caused gradients to explode and corrections to be large and erratic.

After 500 iterations, the MSE plot looks like the figure below. The algorithm seems to have converged to the minimum around the 150 mark. Further training runs the risk of ‘overfitting’ the algorithm to the data, meaning it won’t be able to accurately predict unseen (new) data because it has been so tightly fit to the training data. This gets into the bias/variance tradeoff, something I won’t cover here. A general rule of thumb is to stop training when the MSE curve flattens and you’ll avoid overfitting.

Final coefficient values after 500 iterations:

Plotting Age vs. Price and Size vs. Price prediction plots every 100 iterations shows how the prediction line continues to move towards and take the shape of the true relationship between the variable (Age/Size) and target (Price).

Having solved for the coefficients m1, m2, and b that are accurately predicting the sale price, we can answer the original question:

Before plugging in the house ages and house sizes into the algorithm, they have to be scaled to the same scale our original training data was scaled to, meaning we have to use the mean and standard deviation of the training data. This is because the algorithm was trained using that data at that scale.

Standardize the new house age data:

Standardize the new house size data:

With the data scaled correctly, we can now plug the numbers into the equation:

Solve for the first house price estimate:

Solve for the second house price estimate:

Done! The houses should be priced at $334,693.81 and $190,876.28 respectively. Here is the code used to solve the multiple regression above.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
## multiple regression ##
 
# create training data
x1 = np.array([25,65,21,7,13])
x2 = np.array([2200,2500,3200,1750,1100])
y = np.array([275000,245000,350000,245000,195000])
 
# set initial values of m1, m2, and b
m1 = 1
m2 = 1
b = 1
 
# set learning rate
lr = 0.01
 
# scale the data
x1_mean = np.mean(x1)
x1_std = np.sqrt(np.mean(np.square(x1 - x1_mean)))
z1 = np.array((x1 - x1_mean)/x1_std)
 
x2_mean = np.mean(x2)
x2_std = np.sqrt(np.mean(np.square(x2 - x2_mean)))
z2 = np.array((x2 - x2_mean)/x2_std)
 
# initialize variables
num_episodes = 500
yhs = []
m1s = []
m2s = []
bs = []
mses = []
 
# run training
for i in range(num_episodes):
    # make predictions using current values of m1 & m2 & b
    yh = m1*z1 + m2*z2 + b
    # store predictions for tracking
    yhs.append(yh)
    # subtract prediction from ground truth to get residual
    y_yh = y - yh
    # calcuate the mse
    mse = np.sum((np.square(y_yh)))/len(y)
    # store mse for tracking
    mses.append(mse)
    # store coefficients in arrays for tracking
    m1s.append(m1)
    m2s.append(m2)
    bs.append(b)
 
    # update coefficients using gradient descent
    m1 -= lr*np.mean(-2*z1*(y_yh))
    m2 -= lr*np.mean(-2*z2*(y_yh))
    b -= lr*np.mean(-2*(y_yh))
 
# print results to log
print(f'mse={mse}')
print(f'm1={m1}')
print(f'm2={m2}')
print(f'b={b}')
 
# create MSE plot
plt.plot(np.arange(len(mses)), mses)
plt.title('MSE over 500 Iterations')
plt.xlabel('Iteration (counting starts at 0)')
plt.ylabel('MSE')
plt.show()
 
# create contour plot for m1 and m2
# create 100 equally spaced data points between the first and last entries of each
# coefficient 
M1 = np.linspace(m1s[0], m1s[-1], 100)
M2 = np.linspace(m2s[0], m2s[-1], 100)
B = np.linspace(min(bs), max(bs), 100)
 
# create a matrix the shape of len(m1) by len(m2) to serve as a grid of values to
# be used in contour plot (every value of m1 with every value of m2)
J = np.zeros((M1.size, M2.size))
 
# calculate the mse for each (m1,m2) pair
for i in range(len(M1)):
    for j in range(len(M2)):
        yh = M1[i]*z1 + M2[j]*z2 + bs[-1]
        y_yh = y-yh
        mse = np.sum((np.square(y_yh)))/len(y)
        J[i,j] = mse
 
# create a contour plot with color filled slices 
plt.contourf(M1, M2, J, cmap='coolwarm', alpha=0.7)
plt.axhline(0, color='black', alpha=0.5, dashes=[2,4], linewidth=1)
plt.axvline(0, color='black', alpha=0.5, dashes=[2,4],linewidth=1)
# create arrows showing update over every 100 iterations
for i in range(len(m1s)-100):
    if i % 100 == 0:
        plt.annotate('', xy=[m1s[i+100],m2s[i+100]], xytext=[m1s[i],m2s[i]],
                    arrowprops={'arrowstyle': '->', 'color': 'r', 'lw': 1},
                    va='center', ha='center')
        if i == len(m1s):
            break
 
# create a contour plot with annotated slices to lay on top of filled contour
CS = plt.contour(M1, M2, J, linewidths=1,colors='black')
plt.clabel(CS, inline=1, fontsize=8)
plt.title("Contour Plot of Gradient Descent")
plt.xlabel("M1")
plt.ylabel("M2")
plt.show()
 
## create a plot showing prediction line every 100 iterations
# for m1
for i in range(len(m1s)):
    if i % 100 == 0: # fires for i = 0, 100, 200, 300, 400
        pred = m1s[i]*z1 + m2s[i]*z2 + b # predict at iteration i
        plt.plot(x1, pred, ls='--', alpha=0.5, label=f'iteration {i}') # plot
pred = m1s[-1]*z1 + m2s[-1]*z2 + b # prediction at final iteration (500)
plt.plot(x1, pred, ls='--', alpha=0.5, label=f'iteration {len(m1s)}') # plot
plt.plot(x1, y, label='True') # plot true data
plt.title('Age vs. Price')
plt.xlabel('Age')
plt.ylabel('Price')
plt.legend()
plt.show()
 
# do the same as above but for m2
for i in range(len(m1s)):
    if i % 100 == 0:
        pred = m1s[i]*z1 + m2s[i]*z2 + b
        plt.plot(x2, pred, ls='--', alpha=0.5, label=f'iteration {i}')
pred = m1s[-1]*z1 + m2s[-1]*z2 + b
plt.plot(x2, pred, ls='--', alpha=0.5, label=f'iteration {len(m1s)}')
plt.plot(x2, y, label='True')
plt.title('Size vs. Price')
plt.xlabel('Size')
plt.ylabel('Price')
plt.legend()
plt.show()
 
## predict house age=[17,52] & house size=[2950,1600]
# create new x1 and x2 input data to predict on
x1_new = [17,52]
x2_new = [2950,1600]
 
# scale data to scale of initial training data by using mean and standard deviation
# of training data
z1_new = np.array((x1_new - x1_mean)/x1_std)
z2_new = np.array((x2_new - x2_mean)/x2_std)
 
# make prediction
y_new = m1*z1_new + m2*z2_new + b
 
# print predictions to log
print(f'y_new: {y_new}')

Normally, the training data would contain many more data points to learn from, and the accuracy of the algorithm would be tested using data it hasn’t seen before but the correct answer for which is known. Now that the mathematics of regression is understood on a smaller scale, let’s move up to larger datasets and introduce the concepts of training and testing sets, along with vector and matrix multiplication, which I’ll cover in my next post.

Demystifying the Math Behind Multiple Regression

Feature Scaling

Written by Jason Hayes