Linear Regression

Kien Duong
3 min readSep 22, 2023

Linear Regression is defined as an algorithm that provides a linear relationship between an independent variable and a dependent variable to predict the outcome of future events. It’s a basic algorithm of supervised learning.

For example, from the price dataset of more than 4000 houses, we will build a model to predict the house price that based on 3 attributes: bedrooms, bathrooms & floors. The input values are defined as a column vector:

So we have the output model that depends on the input values:

Our expectation is that the prediction error e between y & y’ should be minimum.

We have to use the square of e because y — y’ can be negative. 1/2 is used to simplify the calculation of derivatives.

1. Loss function

When we have N training data points, each traning data point has the prediction error. Our problem is to create a model so that the sum of the prediction errors is minimum. It’s similar to find the weight vector w so that the following function has the minimum value.

In machine learning, the loss function is the average of the error at each point that why we have the 1/2N in that formula. As you can see, L(w) has the relationship with 2-norm, so we can transform above formula to:

2. Solution

The next step, we need to find the solution for that loss function by solving the derivative equation.

We have this transformation by using the equation ||x||2^2 = x^Tx when x is a vector. Let’s prove this one:

From (2.1) we have

You can read this post to understand about (2.2)

A = XX^T is a square matrix.

– If A is invertible, so (2.3) has the solution:

– If A is not invertible, we must find pseudoinverse A+ of A. We know that every matrix has a SVD, so now the problem is to find pseudoinverse of a SVD.

Where Σ+ is formed from Σ by taking the reciprocal of all the non-zero elements, leaving all the zeros alone, and making the matrix the right shape. If Σ is an m x n matrix, then Σ+ must be an n x m matrix.

In fact, solving derivative equations is an expensive calculation. So we need to use another way (Gradient Descent) to find the minimum of a function. It should be cheaper (faster) to find the solution using the gradient descent.

--

--