Linear regression in Machine Learning: A mathematical guide

7 min readJun 7, 2024

If you have trained machine learning models before, you may have been surprised by how much you can get done without knowing anything about what’s under the hood. You can treat Machine Learning models and their training algorithms mostly like black boxes. In many situations you don’t really need to know the implementation details.

A black box is a system which produces useful information without revealing any information about its internal workings.

However, having a good understanding of how things work can help you to:

Quickly home in on the appropriate model, the right training algorithm to use, and a good set of hyperparameters for your task.
Debug issues and perform error analysis more efficiently.
Lastly, it will be essential in understanding, building, and training neural networks

Linear Regression model

We will start by looking at the Linear Regression model, one of the simplest models there is. We will discuss two very different ways to train it:

Using a direct closed-form equation that directly computes the model parameters that best fit the model to the training set (normal equation)
Using an iterative optimization approach, called Gradient Descent (GD)

Let’s look at a simple regression model of life satisfaction with per capita GDP : life_satisfaction = θ₀ + θ₁ × GDP_per_capita. This model is just a linear function of the input feature GDP_per_capita. θ₀ and θ₁ are the model’s parameters.

More generally, a linear model makes a prediction ŷ (called y-hat) by simply computing a weighted sum of the input features (x₁, x₂, etc…) plus a constant called the bias term (θ₀ — also called the intercept term), as shown in the equation below:

This can be written much more concisely using a vectorized form, as shown below:

θ is the model’s parameter vector, containing the bias term θ₀ and the feature weights θ₁ to θₙ.
x is the instance’s feature vector, containing x₀ to xₙ, with x₀ always equal to 1.
θ · x is the dot product of the vectors θ and x, which is of course equal to θ₀x₀ + θ₁x₁1 + ⋯ + θₙxₙ.
hθ is the hypothesis function, using the model parameters θ.

Training the model

Okay, that’s the Linear Regression model. So now how do we train it? Well, training a model means setting its parameters so that the model best fits the training set. For this, we first need a measure of how well (or poorly) the model fits the training data.

The most common performance measure of a regression model is the Root Mean Square Error (RMSE). Therefore, to train a Linear Regression model, you need to find the value of θ that minimizes the RMSE. In practice, it is simpler to minimize the Mean Square Error (MSE), and it leads to the same result.

Let’s understand these terms using a dataset that predicts the median house income in a district

MSE(X, h) is the cost function measured on the set of examples using your hypothesis h.
m is the number of instances in the dataset you are measuring the MSE on.

— For example, if you are evaluating the MSE on a validation set of 2000 districts, then m = 2000.

x(i) is a vector of all the feature values (excluding the label) of the iᵗʰ instance in the dataset, and y(i) is its label (the desired output value for that instance).

— For example, if the first district in the dataset is located at longitude –118.29°, latitude 33.91°, and it has 1416 inhabitants with a median income of $38,372, and the median house value is $156,400, then:

X is a matrix containing all the feature values (excluding labels) of all instances in the dataset.

There is 1 row per instance
iᵗʰ row is the transpose of iᵗʰ instance

h is your system’s prediction function, also called a hypothesis. When your system is given an instance’s feature vector x(i), it outputs a predicted value ŷ(i) = h(x(i)) for that instance

— For example, if your system predicts that the median housing price in the first district is $158,400, then ŷ(1) = h(x(1)) = 158,400.

— The prediction error for this district is ŷ(1) — y(1) = 2000.

We use lowercase italic font for scalar values (such as m or y(i)) and function names (such as h), lowercase bold font for vectors (such as x(i)), and uppercase bold font for matrices (such as X).

Norms

Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. For example, suppose that there are many outlier districts. In that case, you may consider using the Mean Absolute Error (MAE):

Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. Various distance measures, or norms, are possible:

Computing the root of a sum of squares (RMSE) corresponds to the Euclidean norm: it is the notion of distance you are familiar with. It is also called the ℓ2 norm, noted ∥ · ∥₂ (or just ∥ · ∥).

When used for matrices instead of vectors, the Euclidean norm is called the Frobenius norm

Computing the sum of absolutes (MAE) corresponds to the ℓ1 norm, noted ∥ · ∥₁. It is sometimes called the Manhattan norm because it measures the distance between two points in a city if you can only travel along orthogonal city blocks.

The Green line represents the l1 norm and the Blue line represents the l2 norm

More generally, the ℓp norm of a vector v containing n elements is defined as:

ℓ0 just gives the number of non-zero elements in the vector, and ℓ∞ gives the maximum absolute value in the vector.
The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE.
But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.

The normal equation

To find the value of θ that minimizes the cost function, there is a closed-form solution — in other words, a mathematical equation that gives the result directly. This is called the Normal Equation:

θ-hat is the value of θ that minimizes the cost function.
y is the vector of target values containing y(1) to y(m).

Let’s generate some linear-looking data to test this equation:

import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

As you can see above, the function that we used to generate the data is y = 4 + 3x₁ + Gaussian noise. Keep the values 3 and 4 in mind. This is what the dataset looks like:

Now let’s compute θ using the Normal Equation:

X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
theta_best

Let’s see what the equation found:

array([[4.21509616], [2.77011339]])

We would have hoped for θ₀ = 4 and θ₁ = 3 instead of 4.215 and 2.770. Close enough, but the noise made it impossible to recover the exact parameters of the original function.

Now you can make predictions using θ-hat:

X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)
y_predict

which gives:

array([[4.21509616], [9.75532293]])

Let’s plot this model’s predictions:

plt.plot(X_new, y_predict, "r-")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()

Thanks for reading! In part 2 we will discuss the second training method: Gradient descent, which is one of the most widely used methods in Machine Learning 🎉

Linear regression in Machine Learning: A mathematical guide

Linear Regression model

Training the model

Norms

The normal equation

Written by Chamuditha Kekulawala