ML From Scratch —Linear Regression

4 min readAug 12, 2021

I will write a series of short articles to illustrate how to implement ML codes from scratch and also introduce the algorithm’s pros and cons in short.

Pros:

Runs very fast
Have solid statistical foundation
Easy to explain to other audience without tech background since they can see the coefficient for each feature.

Cons:

Need to spend lots of time on feature engineering
Have statistical assumptions: linear relationship, residual independence, residual homoscedasticity and residual normality.
To speed up the computation, you need to scale the features.

Introduction of Theory

Let’s assume we have m samples and n features.

Expression

We can start with the following formula:

The w* is the coefficient for each feature. x1, x2,…, xn are the features values. b is the bias. We can rewrite b as w0*x0 in which w0=b and x0 are all 1. Thus the formula can be transformed to:

For each sample i, we can write the vector formula

The scalar expansion is written as:

The matrix multiplication is written as:

Then we extend to all m samples.

Loss and cost function

For regression, we use square loss function

The cost function is

We can further write the matrix formula as

This is a convex function.

Gradient Decent

We can write the matrix formula

Implementation:

We use the famous Boston price dataset in sklearn.

Step1: Get the train and test dataset

In this implementation, we need to import MinMaxScalar for scaling the features, train_test_split to get train/test datasets.

Let’s look at the dimension of train dataset

You can consider X’s shape as 404*13 and y_train is column vector and the shape is 404*1

Step2: Define LinearRegression class and set init values

To simplify the implementation, we only set 3 main parameters. alpha is learning rate, epoch is the loop numbers, fit_bias is the bias b in the formula above. cost_record is a list to save all cost function results. Our goal is the find the parameters with the smallest cost.

Step3: Define the predict() function

We write the predict function as the dot product of matrix X and coefficient w

Step4: Define the fit() function

This is the most difficult part. We need to use gradient decent to update the w. One trick part is the 1-D array. When we use transpose function .T, it will not influence its shape. One example is