Linear Regression and Gradient Descent in NumPy

7 min readApr 6, 2022

Building the linear model from scratch in NumPy with gradient descent!

Overview

The aim of this article is to better understand the mechanics behind Linear Regression and Gradient Descent by building the model using NumPy. Once the model is built we will visualize the process of gradient descent.

Here below we have the linear equation, where our output is a function of our input data, weight, and bias terms. These weight and bias terms are referred to as the internal parameters of our model. Our goal in Linear Regression is to find the optimal values for these parameters that minimizes prediction error.

Implementing Linear Regression:

There are multiple methods to solve for these parameters and implement Linear Regression. One way is to solve for the parameters directly, which is often referred to as Ordinary Least Squares or the analytical solution. While accurate, this method employs matrix inversion which is computationally expensive and scales at n³ complexity (thats bad). This operation is also not possible with certain shapes of data.

Another popular method is called Gradient Descent, which allows us to take an iterative approach to approximate the optimal parameters. We start with a random guess for the parameters, and iteratively adjust the values to be better and better. This process scales more favorably with larger datasets, is more flexible, and can be parallelized! So we use gradient descent to approximate the parameters of the model that minimize prediction error. An overview of the gradient descent process can be seen here below:

Gradient Descent

Generate predictions given the current weight and bias
Calculate the error
Adjust the weights and bias
Repeat until convergence

Sounds simple enough, but how do we adjust our weight and bias terms? To answer that question we will cover two important topics, cost functions and partial derivatives.

Cost Functions

Cost functions are representations of error, the difference between our prediction and the actual value. A popular cost function is Mean Squared Error (MSE), which can be seen below.

Why do we need a cost function? When we have a cost function, we have something to optimize. Our final model will have parameters that minimize this cost function. Keep in mind, minimizing the cost function means minimizing our prediction error. The best model will yield the least amount of prediction error.

One way to think about this is that for all possible values for the weight and bias, there is an associated cost/error. Below is a 3D visualization of a simple convex cost function. There are 3 dimensions, weight, bias, and cost. Let’s say that to get started we initialize our model with random values for its weight and bias terms, both set at zero. Intuitively our predictions would be poor with this model, and our model cost (error) would be high. That would place us at the blue point on the cost function below.

Each point on this plane represents a different linear regression model, a different combination of weight and bias and the resulting cost. The parameters with the lowest cost would plot at the minimum of this function on the red dot. Here we could just look at the plot and eyeball those optimal values, however this would require us to build this plane. That means building a model at every single possible combination which isn’t feasible. What we need to do is to iteratively move closer and closer to the minimum, or descend down our cost function (green line). Realize, this means adjusting our parameters to be closer and closer to those optimal values. We “move” across that above plane by changing our weight and bias.

How do we know which way to move, or how to adjust our parameters? Should we increase or decrease the bias term to move to the bottom? We determine the proper direction using the power of derivatives.

Derivatives

Recall, the derivative of a function is the slope of the tangent line at a particular point. The slope of the cost curve at a given point tells us a direction and step size. Here we refer to the slope as the gradient. Because we want to find the minimum of the cost function:

If the gradient is negative, we step forward (move in same direction)
If the gradient is positive, we step backward (move in opposite direction)
The steeper the gradient, the larger the step (as we approach optima slopes are flatter)

In reality we are in a multidimensional space (weight x bias) so we need to know what direction to move in each dimension. Partial derivatives allow us to determine the direction to move in each dimension. Taking the partial derivative with respect to an input is to ask: “how does the output change as we move only in this dimension?”. This tells us the gradient in that dimension, and therefore which way to move in that direction! This scales to any number of possible dimensions.

We take the partial derivative of the cost function with respect to our weight and then our bias, and use those results to tweak our current weight and bias values. This makes them a little better each iteration. The outline of the process to can be seen here below:

While our gradients do tell us how much to move via the magnitude of the slope, we need more control over the process. To do so, we multiply our gradient by a scalar known as alpha, which is normally a small value. Alpha is often referred to as the “learning rate”, as it dictates how much we can traverse across our cost function (learn) at each iteration. Alpha is what is known as a hyperparameter, and we set this value when we instantiate our model.

Choosing a poor alpha will result in our model not converging, it won’t find the optimal values. If alpha is too large we will step over the optimal point and miss it completely. If alpha is too small we either won’t arrive at our optimal point or it will take a very long time. In the image below think of the length of the grey arrows as alpha.

Now that we have covered the optimization engine gradient descent, lets implement the model in Python.

The Code

Define Linear Regression Class:

I did my best to annotate each step in the code, so if you are unfamiliar with Python you should still be able to follow the logic!

Define Dummy Data

Instantiate and Fit Model

Output of the last cell (model parameters)

As we can see from the output, our approximations for the parameters are very close to the actuals! We can plot the model loss at each iteration to see how our model improved each time. The data points are closer together at later iterations because as mentioned before, as we approach the optima the slopes becomes smaller leading to a smaller tweak each iteration. We can also see how our predictions compare to our actuals.

Visualizing what just happened

It is much easier to visualize gradient descent if we make this a 3D problem, using only one weight and bias. Lets run our model on a more simple dataset with the following parameters.

When we run the model, it converges and returns the following parameters.

Our approximated weight and bias terms

I created a plane with all of the possible combinations of weight and bias from 0 to 50, calculated a prediction using our linear equation, then computed the cost using the actuals. This is the actual cost function! I added in functionality to the linear regression class to keep historical logs of the weight, bias, and cost at each iteration which can be seen as the dots on the plane.

At the initialized values of weight = 0 and bias = 0, cost was very high. At each iteration, we tweaked the weight and bias parameters using the gradients which moved us incrementally closer to the minimum! The exact same thing happened with our first model that had 3 different weights, just across multiple dimensions!

Wrap up

Hopefully this article helped clarify the foundational concepts of linear regression and gradient descent. Gradient descent and partial derivatives can be explained in a digestible way with the proper approach (and some visualizations). There are other important aspects of linear regression such as coefficient interpretation and model assumptions which I would encourage you to research further if you plan to employ these models. Thank you for reading!