An Introduction to Machine Learning in Python: Simple Linear Regression

14 min readMay 17, 2023

Simple linear regression offers an elegant introduction to machine learning. It can be used to identify the relationship between an independent variable and a dependent variable. Using gradient descent, a basic model can be trained to fit a set of points for future prediction.

Background

This is the first article of a series covering regression, gradient descent, classification, and other fundamental aspects of machine learning. This article focuses on simple linear regression, which identifies the line of best fit for a set of points, allowing for future predictions to be made.

The Line of Best Fit

The line of best fit is the equation that most accurately represents a set of points. For a given input, the equation’s output should be as close to the expected output as possible.

In the image above, it is clear that the middle line fits the blue points better than the left or right lines. However, is it the line of best fit? Could there be a fourth line that fits these points better? The line could certainly be shifted up or down to ensure an equal amount of points fall above and below it. However, there could be a dozen lines that fit this exact criteria. What makes any one of them the best?

Thankfully, there is a way to mathematically identify a line of best fit for a set of points using regression.

Regression

Regression helps identify the relationship between two or more variables, and it takes many forms, including simple linear, multiple linear, polynomial, and more. To demonstrate the usefulness of this approach, simple linear regression will be used.

Simple linear regression attempts to find the line of best fit for a set of points. More specifically, it identifies the relationship between an independent variable and a dependent variable. The line of best fit has the form of y = mx + b.

x is the input or independent variable
m is the slope, or steepness, of the line
b is the y-intercept
y is the output or dependent variable

The goal of simple linear regression is to identify the values of m and b that will generate the most accurate y value when given an x. This equation, also known as the model, can also be evaluated in machine learning terms. In the equation, w represents “weight”: Ŷ = Xw₁+ w₀

X is the input or feature
w₁ is the slope
w₀ is the bias, or y-intercept
Ŷ is the prediction, which is pronounced as “y-hat”

While this is useful, the equation needs to be assessed for its accuracy. If its predictions are poor, it is not very useful. To do this, a cost, or loss, function is used.

The Cost or Loss Function

Regression requires some method of tracking the accuracy of the model’s predictions. Given the inputs, are the outputs of the equation as close as possible to the expected output? A cost function, also known as a loss function, is used to identify the accuracy of an equation.

For instance, if the expected output is 5, and the equation outputs 18, the loss function should represent this difference. A simple loss function could output 13, which is the difference between these values. This indicates the model’s performance is poor. On the other hand, if the expected output is 5 and the model predicts 5, the loss function should output 0, which indicates the model’s performance is excellent.

A commonly used loss function that does this is the mean squared error (MSE):

This function takes finds the difference between the model’s prediction (Ŷ) and the expected output (Y). It then squares the difference to ensure the output is always positive. It does this across a set of points with a size of n. By summing the squared difference of all these points and dividing by n, the output is the mean squared difference (error). It is an easy way of assessing the model’s performance on all the points simultaneously. A simple example can be seen below:

While there are countless other loss functions that are just as applicable to this situation, this is one of the most popular loss functions in machine learning for regression due to its simplicity, especially when it comes to gradient descent, which will be explained later.

To best understand where gradient descent comes in, an example can be evaluated.

Predicting the Line of Best Fit

To show simple linear regression in action, data to train the model is required. This comes in the form of an X array and Y array. The data can be manually generated for this example. It can be created from a “blueprint” function. Randomness can be added to the blueprint, and the model will be forced to learn the underlying function. PyTorch, a standard machine learning library, used to implement regression.

Generating the Data

To start, the code below generates an array of input values using a random integer generator. X currently has a shape of (n samples, num features). Remember a feature is an independent variable, and simple linear regression has 1. For this example, n will be 20.

import torch

torch.manual_seed(5)
torch.set_printoptions(precision=2)

# (n samples, features)
X = torch.randint(low=0, high=11, size=(20, 1))

tensor([[ 9],
        [10],
        [ 0],
        [ 3],
        [ 8],
        [ 8],
        [ 0],
        [ 4],
        [ 1],
        [ 0],
        [ 7],
        [ 9],
        [ 3],
        [ 7],
        [ 9],
        [ 7],
        [ 3],
        [10],
        [10],
        [ 4]])

These values can then be passed through Y = 1.5X + 2 to generate output values, and some randomness can be added to these values using the normal distribution with a mean of 0 and standard deviation of 1. Y will have a shape of (n samples, 1).

The code below shows the random values, which have the same shape.

torch.manual_seed(5)

# normal distribution with a mean of 0 and std of 1
normal = torch.distributions.Normal(loc=0, scale=1)

normal.sample(X.shape)

tensor([[ 1.84],
        [ 0.52],
        [-1.71],
        [-1.70],
        [-0.13],
        [-0.60],
        [ 0.14],
        [-0.15],
        [ 2.61],
        [-0.43],
        [ 0.35],
        [-0.06],
        [ 1.48],
        [ 0.49],
        [ 0.25],
        [ 1.75],
        [ 0.74],
        [ 0.03],
        [-1.17],
        [-1.51]])

Finally, Y can be calculated with the code below.

Y = (1.5*X + 2) + normal.sample(X.shape)

Y

tensor([[15.00],
        [15.00],
        [-0.36],
        [ 6.75],
        [13.59],
        [15.16],
        [ 2.33],
        [ 8.72],
        [ 2.67],
        [ 1.81],
        [13.74],
        [14.06],
        [ 7.15],
        [12.81],
        [15.91],
        [13.15],
        [ 6.76],
        [18.05],
        [18.71],
        [ 6.80]])

They can also be plotted together with matplotlib for a better understanding of their relationship:

import matplotlib.pyplot as plt

plt.scatter(X,Y)
plt.xlim(-1,11)
plt.ylim(0,20)
plt.xlabel("$X$")
plt.ylabel("$Y$")
plt.show()

While it may seem counterintuitive to generate data for the example, it is a great way to demonstrate how regression works. The model, which can be seen below, will only be provided X and Y, and it will need to identify w₁ as 1.5 and w₀ as 2.

The weights can be stored in an array, w. This array will have two weights in it, one for the bias and one for the number of features. It will have a shape of (num features + 1 bias, 1). For this example, the array with have a shape of (2, 1).

torch.manual_seed(5)
w = torch.rand(size=(2, 1))
w

tensor([[0.83],
        [0.13]])

With these values generated, the model can be created.

Creating the Model

The first step for the model is to define a function for the line of best fit and another for the MSE.

As mentioned before, the model has an equation of Ŷ = Xw₁+ w₀. As of now, the bias is added to every sample. This is equivalent to broadcasting the bias to be the same size as X and added the arrays together. The output can be seen below.

w[1]*X + w[0]

tensor([[1.97],
        [2.09],
        [0.83],
        [1.21],
        [1.84],
        [1.84],
        [0.83],
        [1.33],
        [0.96],
        [0.83],
        [1.71],
        [1.97],
        [1.21],
        [1.71],
        [1.97],
        [1.71],
        [1.21],
        [2.09],
        [2.09],
        [1.33]])

The function below computes the output.

# line of best fit
def model(w, X):
  """
    Inputs:
      w: array of weights | (num features + 1 bias, 1)
      X: array of inputs  | (n samples, num features + 1 bias)

    Output:
      returns the predictions | (n samples, 1)
  """

  return w[1]*X + w[0]

The function for the MSE is straightforward:

# mean squared error (MSE)
def MSE(Yhat, Y):
  """
    Inputs:
      Yhat: array of predictions | (n samples, 1)
      Y: array of expected outputs | (n samples, 1)
    Output:
      returns the loss of the model, which is a scalar
  """

  return torch.mean((Yhat-Y)**2) # mean((error)^2)

Previewing the Line of Best Fit

With the functions created, the line of best fit can be previewed with a plot, and a standard function can be created for future use. It will display the line of best fit in red, the predictions for each input in orange, and the expected outputs in blue.

def plot_lbf():
  """
    Output:
      plots the line of best fit in comparison to the training data
  """

  # plot the points
  plt.scatter(X,Y)

  # predictions for the line of best fit
  Yhat = model(w, X)
  plt.scatter(X, Yhat, zorder=3) # plot the predictions

  # plot the line of best fit
  X_plot = torch.arange(-1,11+0.1,.1) # generate values with a step of .1
  plt.plot(X_plot, model(w, X_plot), color="red", zorder=0)

  plt.xlim(-1, 11)
  plt.xlabel("$X$")
  plt.ylabel("$Y$")
  plt.title(f"MSE: {MSE(Yhat, Y):.2f}")
  plt.show()

plot_lbf()

The output with the current weights is not ideal since the MSE is 105.29. To get a better MSE, different weights need to be chosen. They could be randomized again, but the chance of acquiring the perfect line would be minimal. This is where the gradient descent algorithm can be used to alter the value of the weights in a defined manner.

Gradient Descent

An explanation of the gradient descent algorithm can be found here: A Simple Introduction to Gradient Descent. The article should be read before moving on to avoid confusion.

To summarize the article, gradient descent uses the gradient of a cost function to reveal the direction and influence of each weight on it. By scaling the gradient with a learning rate and subtracting it from each weight’s current value, the cost function minimizes, forcing the model’s prediction to be as close to the expected output as possible.

For simple linear regression, f will be the MSE. The Python implementation can be seen below. Remember, each weight has its own partial derivative to be used in the formula, which can be seen above.

# optimizer
def gradient_descent(w):
  """
    Inputs:
      w: array of weights | (num features + 1 bias, 1)

    Global Variables / Constants:
      X: array of inputs  | (n samples, num features + 1 bias)
      Y: array of expected outputs | (n samples, 1)
      lr: learning rate to scale the gradient

    Output:
      returns the updated weights
  """ 

  n = len(X)

  # update the bias
  w[0] = w[0] - lr*2/n * torch.sum(model(w,X) - Y)
  
  # update the weight
  w[1] = w[1] - lr*2/n * torch.sum(X*(model(w,X) - Y))

  return w

Now, the function can be used to update the weights. The learning rate is selected empirically, but it is normally a small value. The new line of best fit can also be plotted.

lr = 0.01

print("weights before:", w.flatten())
print("MSE before:", MSE(model(w,X), Y))

# update the weights
w = gradient_descent(w)

print("weights after:", w.flatten())
print("MSE after:", MSE(model(w,X), Y))

plot_lbf()

weights before: tensor([0.83, 0.13])
MSE before: tensor(105.29)
weights after: tensor([1.01, 1.46])
MSE after: tensor(2.99)

The MSE decreased by more than 100 on the first try, but the line still doesn’t fit the points perfectly. Remember, the goal is to get w₀ as 2 and w₁ as 1.5. To speed up the learning process, gradient descent can be performed 500 more times, and the new result can be examined.

# update the weights
for i in range(0, 500):
  # update the weights
  w = gradient_descent(w)

  # print the new values every 10 iterations
  if (i+1) % 100 == 0:
    print("epoch:", i+1)
    print("weights:", w.flatten())
    print("MSE:", MSE(model(w,X), Y))
    print("="*10)

plot_lbf()

epoch: 100
weights: tensor([1.44, 1.59])
MSE: tensor(1.31)
==========
epoch: 200
weights: tensor([1.67, 1.56])
MSE: tensor(1.25)
==========
epoch: 300
weights: tensor([1.80, 1.54])
MSE: tensor(1.24)
==========
epoch: 400
weights: tensor([1.87, 1.53])
MSE: tensor(1.23)
==========
epoch: 500
weights: tensor([1.91, 1.52])
MSE: tensor(1.23)
==========

After 500 epochs, the MSE is 1.23. w₀ is 1.91, and w₁ is 1.52. This means the model successfully identified the line of best fit. Additional updates could be performed, but the randomness added to the output values will likely prevent the model from achieving a perfect prediction.

To build additional intuition about how gradient descent works, the impact of w₀ and w₁ can be examined by plotting them with their output, the MSE. The function for plotting gradient descent can be examined in the appendix, and the output can be examined below:

torch.manual_seed(5)
w = torch.rand(size=(2, 1))

w0s, w1s, losses = list(),list(),list()

# update the weights
for i in range(0, 500):
  if i == 0 or (i+1) % 10 == 0:
    w0s.append(float(w[0]))
    w1s.append(float(w[1]))
    losses.append(MSE(model(w,X), Y))

  # update the weights
  w = gradient_descent(w)

plot_GD([-2, 5.2], [-2, 5.2])

Each orange point represents an update to the weights, and the red line represents the change from one iteration to the next. The largest update is from the first to second iteration, which is the red line. The other orange points are close together since their derivatives are small, making the updates even smaller. The plot shows how the weights update until the optimal MSE is acquired.

While the approach is useful, it could be simplified in a few ways. First, it does not take advantage of matrix multiplication, which would simplify the equation for the model. Second, gradient descent is not a closed-form solution to regression since the number of epochs and the learning rate vary for every problem, and the solution is an approximation. The last section of this article will address the first problem, and the next article will address the second.

An Alternative Approach

While this approach is useful, it is not as simple as it could be. It does not take advantage of matrices. As of now, the entire equation, Ŷ = Xw₁+ w₀, is used for the model’s function, and the partial derivative of each weight has to be calculated individually for gradient descent. By using matrix operations and calculus, both functions simplify.

To start, X has a shape of (n samples, num features), and w has a shape of (num features + 1 bias, 1). By adding an additional column to X, matrix multiplication can be used because it will have a new shape of (n samples, num features + 1 bias). This can be a column of ones that will be multiplied against the bias, which will scale the vector. This is equivalent to broadcasting the bias, which is how the predictions were previously calculated.

X = torch.hstack((torch.ones(X.shape),X))
X

tensor([[ 1.,  9.],
        [ 1., 10.],
        [ 1.,  0.],
        [ 1.,  3.],
        [ 1.,  8.],
        [ 1.,  8.],
        [ 1.,  0.],
        [ 1.,  4.],
        [ 1.,  1.],
        [ 1.,  0.],
        [ 1.,  7.],
        [ 1.,  9.],
        [ 1.,  3.],
        [ 1.,  7.],
        [ 1.,  9.],
        [ 1.,  7.],
        [ 1.,  3.],
        [ 1., 10.],
        [ 1., 10.],
        [ 1.,  4.]])

This changes the equation to Ŷ = X₁w₁+ X₀w₀. Moving forward, the bias can be considered a feature, so num features can represent both the independent variable and bias, and the + 1 bias can be omitted. Therefore, X has a size of (n samples, num features), and w has a size of (num features, 1). When they are multiplied by each other, the output is the prediction vector, which has a size of (n samples, 1). The output of the matrix multiplication is the same as w[1]*X + w[0].

torch.manual_seed(5)
w = torch.rand(size=(2, 1))

torch.matmul(X, w)

tensor([[1.97],
        [2.09],
        [0.83],
        [1.21],
        [1.84],
        [1.84],
        [0.83],
        [1.33],
        [0.96],
        [0.83],
        [1.71],
        [1.97],
        [1.21],
        [1.71],
        [1.97],
        [1.71],
        [1.21],
        [2.09],
        [2.09],
        [1.33]])

With this in mind, the model’s function can be updated:

# line of best fit
def model(w, X):
  """
    Inputs:
      w: array of weights | (num features, 1)
      X: array of inputs  | (n samples, num features)

    Output:
      returns the output of X@w | (n samples, 1)
  """

  return torch.matmul(X, w)

Since each weight is no longer thought of as an individual component, the gradient descent algorithm can also be updated. Based on A Simple Introduction to Gradient Descent, the gradient descent algorithm for matrices follows:

This can easily be implemented with PyTorch. Since w was reshaped at the beginning of the article, the derivative’s output needs to be reshaped for subtraction.

# optimizer
def gradient_descent(w):
  """
    Inputs:
      w: array of weights | (num features, 1)

    Global Variables / Constants:
      X: array of inputs  | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      lr: learning rate to scale the gradient

    Output:
      returns the updated weights | (num features, 1)
  """ 

  n = X.shape[0]

  return w - (lr * 2/n) * (torch.matmul(-Y.T, X) + torch.matmul(torch.matmul(w.T, X.T), X)).reshape(w.shape)

Using 500 epochs, the same output can be generated as before:

lr = 0.01

# update the weights
for i in range(0, 501):
  # update the weights
  w = gradient_descent(w)

  # print the new values every 10 iterations
  if (i+1) % 100 == 0:
    print("epoch:", i+1)
    print("weights:", w.flatten())
    print("MSE:", MSE(model(w,X), Y))
    print("="*10)

epoch: 100
weights: tensor([1.43, 1.59])
MSE: tensor(1.31)
==========
epoch: 200
weights: tensor([1.66, 1.56])
MSE: tensor(1.25)
==========
epoch: 300
weights: tensor([1.79, 1.54])
MSE: tensor(1.24)
==========
epoch: 400
weights: tensor([1.87, 1.53])
MSE: tensor(1.23)
==========
epoch: 500
weights: tensor([1.91, 1.53])
MSE: tensor(1.23)
==========

Since these functions do not require additional variables to be manually added for each feature, they can be used for multiple linear regression and polynomial regression.

Conclusion

The next article will discuss the closed-form solution to regression that does not approximate the weights. Instead, the minimized values will be directly computed using An Introduction to Machine Learning in Python: The Normal Equation for Regression in Python.

Please don’t forget to like and follow! :)

References

Plotly 3D Plots

Appendix

Plotting Gradient Descent

This function utilizes Plotly to display gradient descent in three dimensions.

import plotly.graph_objects as go
import plotly
import plotly.express as px

def plot_GD(w0_range, w1_range):
  """
    Inputs:
      w0_range: weight range [w0_low, w0_high]
      w1_range: weight range [w1_low, w1_high]

    Global Variables:
      X: array of inputs  | (n samples, num features + 1 bias)
      Y: array of expected outputs | (n samples, 1)
      lr: learning rate to scale the gradient
      
    Output:
      prints gradient descent
  """ 

  # generate all the possible weight combinations (w0, w1)
  w0_plot, w1_plot = torch.meshgrid(torch.arange(w0_range[0],
                                                 w0_range[1],
                                                 0.1),
                                    torch.arange(w1_range[0],
                                                 w1_range[1],
                                                 0.1))
                                 
  # rearrange into coordinate pairs
  w_plot = torch.hstack((w0_plot.reshape(-1,1), w1_plot.reshape(-1,1)))

  # calculate the MSE for each pair
  mse_plot = [MSE(model(w, X), Y) for w in w_plot]

  # plot the data
  fig = go.Figure(data=[go.Mesh3d(x=w_plot[:,0], 
                                  y=w_plot[:,1],
                                  z=mse_plot,)])

  # plot gradient descent on loss function
  fig.add_scatter3d(x=w0s, 
                    y=w1s, 
                    z=losses, 
                    marker=dict(size=3,color="orange"),
                    line=dict(color="red",width=5))
  
  # prepare ranges for plotting
  xaxis_range = [w0 + 0.01 if w0 < 0 else w0 - 0.01 for w0 in w0_range] 

  yaxis_range = [w1 + 0.01 if w1 < 0 else w1 - 0.01 for w1 in w1_range] 

  fig.update_layout(scene = dict(xaxis_title='w<sub>0</sub>', 
                                 yaxis_title='w<sub>1</sub>', 
                                 zaxis_title='MSE',
                                 xaxis_range=xaxis_range,
                                 yaxis_range=yaxis_range))
  fig.show()