Linear Regression From Scratch- 3 Methods

Alex
9 min readFeb 25, 2022

--

Introduction:

In this two-series part article, we will be implementing linear regression from scratch without any libraries using 3 different methods: gradient descent, maximum likelihood estimator, and maximum a posterior.

Before reading any further, I assume the reader has knowledge of the following: Python, Pandas, NumPy, partial derivatives, matrix multiplication, and Bayesian inference.

If you need a refresher on any of these, no worries I will be referencing links to the following topics including the code in this article.

Learning Outcome:

  • How to estimate parameters for linear regression models using 3 different methods.
  • Gain a better understanding of linear regression and the math gradient descent, maximum likelihood estimator, and maximum a posteriori.
  • How to make predictions with linear models.

Article Outline:

  • Linear Regression Review
  • Preparing Our Data
  • Gradient Descent

Linear Regression Review:

A linear model is the best fit line to a bunch of scattered data points. If you recall from high school, a simple line is represented as y=mx+b with slope m and y-intercept b.

Linear regression is an algorithm for finding this best fit line by estimating the best possible parameters for our weight “m” and bias “b”. The more independent x variables we have the more weights we will also have.

In the graph above we see a linear relationship between our variable x and y, as x increases y increases. In math, we call this correlation.

No correlation.

If no correlation then our data points are too complex for our machine learning model to estimate the best parameters for our line of best fit.

Linear Regression Assumptions (rules):

  • Linearity: There is a linear relationship between our independent x and dependent y variables.
  • Homoscedasticity: Errors of residuals or differences between predicted and true output are independent of each other. This means better accuracy for our machine learning model.
  • Independence: Our independent variables are independent of each other with little to no multi-correlation.
  • Normality: X and y are normally distributed. This makes it easier for our model to train and to be confident in its answer.

For a full linear regression review, can check this awesome article:

Preparing Our Data:

We are going to predict the price based on House area in ft². It is common in machine learning to call our independent x-variables features and call our dependent y-variable label that we are trying to predict.

Code snippet of setting up the data and train test splitting.

In machine learning, it is common to split our data into 20% testing and 80% training. Can visit this article for more information on this.

Testing linear assumptions:

We will now test if our data meets the linear assumptions using the Matplotlib library for visualizing our data.

Linearity test:

import matplotlib.pyplot as plt
plt.scatter(X,y)
plt.title('Price vs House area')
plt.xlabel('House(ft^2')
plt.ylabel('Price($)')

There is a linear relationship for our data and passes this test.

Normality test:

from scipy.stats import norm
import statistics
# Calculating mean and standard deviationmean = statistics.mean(X.flatten())
sd = statistics.stdev(X.flatten())
plt.scatter(X.flatten(), norm.pdf(X.flatten(), mean, sd))
plt.xlabel('House area(ft^2)')
plt.show()
mean = statistics.mean(y)
sd = statistics.stdev(y)
plt.scatter(y, norm.pdf(y, mean, sd))
plt.xlabel('Price($)')
plt.show()

Our data has a bell-shaped curve so it passes this test. This is called Normal Distribution where most of the data is centered around the mean. This is important for our model to be accurate.

Independence test:

Since our data only has one feature it is independent already and passes this test. For more features, this requires a heatmap or other statistical analysis to confirm normality. Can visit this article for more information.

Homoscedasticity test:

This graph is end result of our trained model. Plotting the prediction against testing data shows a constant variance for our residuals and passes this test.

Review:

Our data satisfies the linear regression assumptions. We are ready to start implementing each of the linear regression methods!

The code for all three methods will work for any amount of features. However, more features mean less accuracy. In machine learning, this is called The Curse of Dimensionality.

Gradient Descent:

Consider an estimated best line of fit with parameters ={m,b} for some data points in the graph below.

The difference between our predicted output and true value highlighted in blue points is called residual errors as illustrated by the broken yellow lines in the graph below.

A single residual error can be mathematically expressed by the equation below. Note, we want the positive value so we can square or take absolute.

To get the accurate line of best fit, we need to minimize not just a single residual error but the sum of all these residual errors. This is called cost/loss.

Here are two common equations to represent the average loss:

Mean square error and Mean absolute error formulas

Mean square error (MSE) :

  • Penalizes our residual error by squaring the difference between predicted and true output. In other words, the farther we are from true value the more we need to tweak our model thus the squaring.
  • Is differentiable to reach minimum error.
  • Sensitive to outliers.

Mean absolute error (MAE)

  • Similar to MSE but without the squaring and measures average of residuals.
  • This is more good for outliers (data points outside distribution or far away from all other points).

I will be using MSE in this article since it is easier to optimize and we do not have any outliers. For more information on both formulas can read about it in this article.

Gradient Descent Intuition:

Gradient descent is an iterative optimization algorithm for updating our parameters (weights and bias) by finding the minimum of our loss function.

Note: Linear Regression is like a car, and different techniques like gradient descent act as the driver.

As we minimize the cost we slowly get the best fit and thus the best parameters as illustrated by the graph below.

The first iteration of the line of best fit.
The second iteration of the line of best fit.
The third iteration of the line of best fit.
The final iteration of the line of best fit.

Math intuition behind Gradient Descent:

To get the best parameters for our machine learning model we need to minimize our loss formula. To minimize/maximize we differentiate.

We will differentiate the MSE using the chain rule for our weight and bias. Here we go:

Partial derivative of bias and weights

It is common in machine learning to add a constant. In this case, the 1/2 to simplify the math. This is just scaling.

However, the summation for our weight is for scalar values. Since we are working with 2D arrays or matrices, we will remove the summation:

Vector form for weights

You may have noticed that we took the transposed of X. We need it for the math to work out with matrix multiplication. For example, consider our training data with x shape of (36,1) and y (36,1):

Matrix multiplication

The result is (1,1) shape for our weight which is what we are after for y=mx+b, one weight m. If more weights are needed, the math will still work. For example, if two features, x would have shape (2,36). The result is (2,1).

Plotting the loss output and our parameters will give the two graphs below.

Gradient Descent

The derivatives we calculated give us the slope for our current weight and bias. Our goal is to find the best weight and bias which is called the global minimum.

Learning rate in updating parameters

Every time we find a partial derivative, we need to update our parameters. We subtract the derivative so that if the slope is negative or positive it will always move in the direction of the global minimum.

We multiply the derivative by a learning rate or step size so that our line of best fit does not overshoot and give inaccurate results as illustrated below.

Updating parameters without learning rate

For linear regression, our loss equation will always converge to the global minimum. For more complex models with many local minimums, we use an optimizer called adam to fix this which is popular in neural networks.

And that's all the math for gradient descent! For a better understanding you can visit this awesome article:

Gradient Descent Implementation:

The algorithm for gradient descent that we will be implementing is below:

given: parameters= {weight and bias}, learning rate 
for iteration= 0....n:
dparams = loss(parameters)
parameters = parameters - dparams * learning rate
return parameters

This is just a pseudo code of gradient descent algorithm, but we will be adding a few more lines for good coding practice.

Linear regression code with gradient descent
import matplotlib.pyplot as pltiteration = 300
learning_rate = 0.00000001
model = LinearRegression(X_train,y_train,learning_rate,iteration)results_dict = model.train()params= results_dict['params']costs = results_dict['costs']iterations = np.arange(iteration)plt.xlabel('iterations')plt.title('iteration vs cost')plt.ylabel('costs')plt.plot(iterations,costs)

Here is the result of what our costs look like:

Relationship between cost and iterations

We see that our cost function is converging to a global minimum which is giving us good parameters for our line of best fit. If the graph was flipped upside down, then it would be diverging which is the opposite of converging.

Predicting Test Data:

To predict new data, we will use the estimated parameters from our machine learning model and multiply the coefficient or parameters with the features to get an estimated prediction.

We will also be comparing the correct output with the predicted.

plt.scatter(X_test,y_test)bias = params[-1,:]weights = params[0:-1,:]prediction = np.dot(X_test,weights) + biasplt.xlabel('House area(ft^2)')plt.ylabel('Price($)')plt.title('Price vs House area')linear_equation = "{:.2f}x+{:.2f}".format(weights[0][0],bias[0])plt.plot(X_test,prediction,color='k',label=linear_equation)plt.legend()
plt.show()
Graph of predicted data

The estimated parameters from our machine learning model are well for our test and training data. This is called low bias (low training error) and low variance (low testing error). For more information can read this article.

Summary:

Gradient descent is similar to mountains

In summary, gradient descent is an algorithm for finding the best parameters for our machine learning model by optimizing the cost function and updating the parameters of our model iteratively. For more information on the math behind gradient descent, can please visit this article.

In the second part of this article series, we will implement the other two methods: Maximum Likelihood Estimator and Maximum a Posterior.

--

--