An Introduction to Machine Learning in Python: Multiple Linear Regression

7 min readMay 19, 2023

Multiple linear regression is used to assess the relationship between many independent variables and one dependent variable.

Background

This article follows An Introduction to Machine Learning in Python: Simple Linear Regression; it covered simple linear regression, gradient descent, and the MSE. This article will cover multiple linear regression and cover some new machine learning terminology.

Multiple Linear Regression

While simple linear regression has an equation of Ŷ = w₁X₁ + w₀X₀, multiple linear regression has a generic formula for k independent variables:

Note that X₀ is a column of ones for the bias; this allows for the generalized formula discussed in the first article. As the formula demonstrates, multiple linear regression helps identify the relationship between many independent variables (X) and a single dependent variable (Ŷ). It does this by learning the values of each weight (w).

To demonstrate this in action, multiple linear regression with 3 weights can be used:

This formula will create a “plane of best fit” for three dimensional data:

The Implementation

Like the previous example, a “blueprint” equation can be used, and randomness can be added. Then, the model can try and learn the weights. When using a model on real data, it is common to split the data into at least two sets: the train set and the test set. The train set is used to train the model and acquire the weights. The test set is used to evaluate the model’s performance on data it has never seen before. If it performs well on both, it is more likely to be useful for real-world applications. It is common to use a split of 80% train, 20% test.

For this example, 1000 samples can be generated and split into test and train sets. Also, since there are two independent variables and a bias, there will be three columns. Each column will represent an independent variable or bias, and each row will represent a sample of the variables, (X₀, X₁, X₂). The shape of the overall data will be a matrix with a size of (1000, 3); a general shape would be (n samples, num features). Remember, independent variables are also known as features in machine learning. The train set will have a size of (800, 3), and the test set will have a size of (200, 3).

The data for this article will be based around Y = 6X₂ + 3X₁ + 2. This means w₀ is 2, w₁ is 3, and w₂ is 6. In the example below, 1000 values between -250 and 250 are generated for X₁ and X₂, and 1000 ones are generated for X₀. They are reshaped into columns and stacked horizontally to create a matrix with a size of (1000, 3). The output is generated using the aforementioned equation, and values from a normal distribution with a standard deviation of 10 are added.

import torch

torch.manual_seed(5)
torch.set_printoptions(precision=2)

# create ones for the bias | 1000 ones
X0 = torch.ones(1000).reshape(-1,1)

# create values for the first feature | 1000 numbers from -250 to 250
X1 = (500*(torch.rand(1000) - 0.5)).reshape(-1,1) 

# create values for the second feature | 1000 numbers from -250 to 250
X2 = (500*(torch.rand(1000) - 0.5)).reshape(-1,1)

# stack data together, X0 = X[:,0], X1 = X[:,1], X2 = X[:,2]
X = torch.hstack((X0, X1,X2))

# normal distribution with a mean of 0 and std of 10
normal = torch.distributions.Normal(loc=0, scale=10)

# output
Y = ((6*X[:,2] + 3*X[:,1] + 2*X[:,0]) + normal.sample(torch.ones(1000).shape)).reshape(-1,1)

This data can be previewed before being split.

import plotly.express as px

fig = px.scatter_3d(x=X[:,1].flatten(),
                    y=X[:,2].flatten(),
                    z=Y.flatten())

fig.update_traces(marker_size=3)
fig.update_layout(scene = dict(xaxis_title='X<sub>1</sub>', 
                               yaxis_title='X<sub>2</sub>', 
                               zaxis_title='Y'))

Now, the data can be split into test and train data:

# split the data
Xtrain, Xtest = X[:800], X[800:]
Ytrain, Ytest = Y[:800], Y[800:]

The train data can then be fit with a plane. To start, the functions for the model, MSE, and gradient descent need to be defined. The same ones from the first article, An Introduction to Machine Learning in Python: Simple Linear Regression, can be used. The end of the article will use the Normal Equation to verify the answer.

# line of best fit
def model(w, X):
  """
    Inputs:
      w: array of weights | (num features, 1)
      X: array of inputs  | (n samples, num features)

    Output:
      returns the output of X@w | (n samples, 1)
  """

  return torch.matmul(X, w)

# mean squared error (MSE)
def MSE(Yhat, Y):
  """
    Inputs:
      Yhat: array of predictions | (n samples, 1)
      Y: array of expected outputs | (n samples, 1)
    Output:
      returns the loss of the model, which is a scalar
  """
  return torch.mean((Yhat-Y)**2) # mean((error)^2)

# optimizer
def gradient_descent(w):
  """
    Inputs:
      w: array of weights | (num features, 1)

    Global Variables / Constants:
      X: array of inputs  | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      lr: learning rate to scale the gradient

    Output:
      returns the updated weights
  """ 

  n = X.shape[0]

  return w - (lr * 2/n) * (torch.matmul(-Y.T, X) + torch.matmul(torch.matmul(w.T, X.T), X)).reshape(w.shape)

Training the Model

With the functions created, the model can be trained to identify the plane of best fit. To start, three random weights can be generated.

torch.manual_seed(5)
w = torch.rand(size=(3, 1))
w

tensor([[0.83],
        [0.13],
        [0.91]])

The current plane of best fit and its MSE can be analyzed below. The plane is in orange, the train set is in red, and the test set is in green.

import plotly.graph_objects as go
def plot_model(x1_range, x2_range):
  """
    Inputs:
      x1_range: x1-axis range [low, high]
      x2_range: x2-axis range [low, high]

    Global Variables:
      Xtrain: array of inputs | (n train samples, num features)
      Ytrain: array of expected outputs | (n train samples, 1)
      Xtest:  array of inputs | (n test samples, num features)
      Xtrain: array of expected outputs | (n test samples, 1)
      
    Output:
      prints plane of best fit
  """ 

  # meshgrid of possible combinations of (X1, X2)
  X1_plot, X2_plot = torch.meshgrid(torch.arange(x1_range[0], x1_range[1], 5),
                                    torch.arange(x2_range[0], x2_range[1], 5))
  X0_plot = torch.ones(X1_plot.shape)
  
  # stack together each point (X1, X2) = (X, Y)
  X_plot = torch.hstack((X0_plot.reshape(-1,1),
                         X1_plot.reshape(-1,1), 
                         X2_plot.reshape(-1,1)))
  
  # all possible model predictions (Yhat = Z)
  Yhat = model(w, X_plot)

  # model's plane of best fit
  fig = go.Figure(data=[go.Mesh3d(x=X_plot[:,1].flatten(), 
                                  y=X_plot[:,2].flatten(), 
                                  z=Yhat.flatten(), 
                                  color='orange', 
                                  opacity=0.50)])
  
  # training data
  fig.add_scatter3d(x=Xtrain[:,1].flatten(),
                    y=Xtrain[:,2].flatten(),
                    z=Ytrain.flatten(), 
                    mode="markers",
                    marker=dict(size=3),
                    name="train")
  
  # test data
  fig.add_scatter3d(x=Xtest[:,1].flatten(),
                    y=Xtest[:,2].flatten(),
                    z=Ytest.flatten(), 
                    mode="markers",
                    marker=dict(size=3),
                    name="test")
  
  # name axes
  fig.update_layout(scene = dict(xaxis_title='X<sub>1</sub>', 
                                 yaxis_title='X<sub>2</sub>', 
                                 zaxis_title='Y'))

  fig.show()

plot_model([-250,250], [-250,250])

MSE(model(w,Xtrain), Ytrain)

tensor(653812.81)

Now, a training loop can be created to minimize the MSE. By using 50,000 epochs and a learning rate of 0.00004, the output becomes extremely accurate. These values were chosen empirically. Both the the train and test MSE can be seen as well.

torch.manual_seed(5)
w = torch.rand(size=(3, 1))

lr = 0.00004
epochs = 50000

# update the weights 1000 times
for i in range(0, epochs):
  # update the weights
  w = gradient_descent(w)

  # print the new values every 10 iterations
  if (i+1) % 10000 == 0:
    print("epoch:", i+1)
    print("weights:", w)
    print("Train MSE:", MSE(model(w,Xtrain), Ytrain))
    print("Test MSE:", MSE(model(w,Xtest), Ytest))
    print("="*10)

plot_model([-250,250], [-250,250])

epoch: 10000
weights: tensor([[1.51],
        [2.98],
        [6.00]])
Train MSE: tensor(87.52)
Test MSE: tensor(70.03)
==========
epoch: 20000
weights: tensor([[1.82],
        [2.98],
        [6.00]])
Train MSE: tensor(87.20)
Test MSE: tensor(70.04)
==========
epoch: 30000
weights: tensor([[1.96],
        [2.98],
        [6.00]])
Train MSE: tensor(87.12)
Test MSE: tensor(70.11)
==========
epoch: 40000
weights: tensor([[2.02],
        [2.98],
        [6.00]])
Train MSE: tensor(87.09)
Test MSE: tensor(70.16)
==========
epoch: 50000
weights: tensor([[2.05],
        [2.98],
        [6.00]])
Train MSE: tensor(87.08)
Test MSE: tensor(70.18)
==========

The model predicted the plane of best fit to be Ŷ = 6X₂ + 2.98X₁ + 2.05 instead of Y = 6X₂ + 3X₁ + 2. The test and train MSE’s are within 17 points of each other, which indicates the model generalizes to the unseen test data. The only limitation of this approach is the number of epochs required to minimize the loss function. An alternative approach would be to use a closed-form solution, which was covered in the previous article, An Introduction to Machine Learning in Python: The Normal Equation for Regression in Python. A closed-form solution does not require a learning rate or epochs to acquire the weights for minimization.

The Normal Equation

def NormalEquation(X, Y):
  """
    Inputs:
      X: array of input values | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      
    Output:
      returns the optimized weights | (num features, 1)
  """
  
  return torch.inverse(X.T @ X) @ X.T @ Y

The code above is the Python implementation for the Normal Equation, which was derived in the previous article. The training data from this example can be used in the equation to directly calculate the optimized weights.

w = NormalEquation(Xtrain,Ytrain)

tensor([[2.19],
        [2.98],
        [6.00]])

The MSE can also be calculated:

MSE(model(w, Xtrain), Ytrain), MSE(model(w, Xtest), Ytest)

(tensor(87.08), tensor(70.20))

The weights and MSE from the Normal Equation and Gradient Descent approaches are nearly identical. In this case, both are equally valid.

Conclusion

Multiple linear regression is useful for identifying the relationship between two or more independent variables, or features, and one dependent variable. It is also the basis of polynomial regression, which will be examined in the next article: An Introduction to Machine Learning in Python: Polynomial Regression.

Please don’t forget to like and follow! :)

References

Sklearn Regression Example