Understanding Linear Regression

Okosa David
9 min readJul 15, 2024

--

ML is easier than you thought.

Machine Learning can be a bit difficult especially for beginners learning without guidance, but grasping the concept of “Linear Regression,” can be the first step leading to mastery.

Regression explained in its most basic terms, is simply the relation between a targeted value and other independent values, seems easy enough doesn’t it?

This simple definition of regression is viable across all subjects, mathematics, statistics and so on, and though there might be an increase in the term’s complexity, in its core, regression is the same. Machine learning algorithms can be grouped based on a plethora of features, but in this article we’ll focus on whether or not they are trained using human supervision. This basis divides algorithms into four classes:

  1. Supervised Learning
  2. Unsupervised Learning
  3. Semi-supervised Learning and lastly
  4. Reinforcement Learning

Now, since linear regression is a supervised learning algorithm, we’ll base our study on this segment.

What is Supervised Learning?

Let’s look at this from our definition of regression. In supervised learning, the data (independent values) which is fed into the proposed algorithm contains the labels (the targeted values). Supervised learning is divided into classification and regression. In classification algorithms, the purpose is to classify labels into a set of discrete variables, this simply means the targeted values can be selected from a set of variables or values which are already preassigned. Example: {Yes/No}, {cat/dog/horse}, easy enough.

Use cases of Classification

  1. Email spam/ham classification
  2. Handwritten Digit Recognition
  3. Whether or not a customer will purchase an item

In regression however the labels are continuous random variables from a set of real numbers ℝ = {…-3, -2, -1, 0, 1, 2…}, which means, the set is not preassigned.

Use cases of Regression

  1. House price Prediction
  2. Stock price prediction
  3. Market sales prediction

Since our goal is linear regression, we will focus solely on regression.

Types of Regression

There are various kinds of regression techniques. Some of them are listed below:

  1. Linear regression
  2. Logistic regression
  3. Polynomial regression
  4. Support vector regression
  5. Lasso regression
  6. Ridge regression
  7. Elastic net regression
  8. Decision tree regression
  9. Random forest regression

Our main focus is linear regression, let’s dive in.

LINEAR REGRESSION

As the name implies, Linear regression focuses on a linear relationship between the targeted values and two or more independent variables or values. What I imply when I say “a linear relationship” is trying to find a linear equation which best relates the targeted and independent values. This linear equation is also known as the best fit line equation.

Graphical representation of the best fit line of a linear model

Understood? Grasping this simple concept is crucial in regression as linear regression is a baseline model for many regression techniques.

Types of Linear Regression

The forms of linear regression are based on the number of related values. In this section we start to evaluate simple mathematical formulas, don’t worry they are easily understood with base knowledge of linear algebra.

  1. Simple linear regression: This is a one to one relationship between the independent and target values. Simply, for one instance of a target value, we link one instance of an independent value to it. For this we have the simple formula for linear regression :
ŷ = β0 + β1x1 
where ŷ is the predicted value
β0 is the intercept (a constant)
β1 is the slope/coefficient/weight of the line
x1 is the independent value/variable

The coefficient(β1) of the best fit line is not constant and is dependent on the value of x and y (from graph). The value of the coefficient is then used to predict the target value of a new independent value. This concept is very important for the next section, remember this.

2. Multiple linear Regression: Multiple linear regression links one instance of a target value to multiple instances of independent values. The number of coefficients/slopes is dependent on the number of instances of the independent values, therefore, in the formula for multiple linear regression there are multiple coefficients. Here is the formula for a multiple linear regression;

ŷ = β0 + β1x1 + β2x2 + ..... +βnxn
where ŷ is the predicted value
β0 is the intercept(a constant)
β1, β2, βn are the slopes/coefficients/weights of the line
x1, x2, xn are the independent values
n is the number of features in a dataset

However, when there is one instance of a targeted value we term this univariate regression and when there are multiple (two or more) instances of the targeted value we term this multivariate regression

Core of Linear Regression

In this section, we delve deep into the mathematical reasoning behind linear regression, I suggest to have a well grounded understanding of the following topics:

  • Matrices, vectors and their properties
  • Determinants
  • Scalar and vector multiplication
  • Dot products of matrices and vectors
  • Inverse of matrices
  • Transpose of matrices

With the following topics understood, Time to dive into the mathematical concepts behind linear regression.

We will recreate the inner workings of a linear regression model which will be used for a house price prediction model. Let us assume a fictitious dataset of values with 4 rows and 4 columns (We disregard the labels[titles] of the dataset);

House prices and their features

Where:

  • x0 = 1 is the intercept term
  • x1 is the house size in square feet
  • x2 is the number of bedrooms for each house
  • y-values are the target variables

Let us begin:

Each column contains similar information about the houses. In machine learning columns of a dataset are regarded as features, as each one contains one feature i.e (x2) would contain the number of bedrooms for each house.

Each row however contains all the information about one house. Rows are regarded as feature vectors as they contain every feature from each house. This concepts will be useful a lot of times, best hold unto it.

The formula for linear regression;

ŷ = β0 + β1x1 + β2x2 + ..... +βnxn
where ŷ is the predicted value
β0 is the intercept(a constant)
β1, β2, βn are the slopes/coefficients/weights of the line
x1, x2, xn are the independent values
n is the number of features in a dataset

In matrix form is written as;

y = XB
where:
X is matrix of data values from the dataset excluding the target values,
B is a column vector containing the values of βn,
y is a column vector containing the predicted/target values.
Therefore:

Now, we have to compute the values of β, this is the formula for finding the matrix, β;

β = (Xᵀ⋅X)⁻¹ ⋅ Xᵀ⋅y
where:
X is the matrix of independent values
Xᵀ is the transpose of the matrix X,
(Xᵀ⋅X)⁻¹ is the inverse of the dot product of the two matrices,
y is the matrix of the predicted/targeted values
⋅ is the dot product of two vectors or matrices

This is the reason why I suggested having prerequisite knowledge of the topics listed above. We will now compute the values of Xᵀ and (X⋅ᵀX), here we go.

Now X⋅ᵀX:

We also need the value of Xᵀ⋅y as well, therefore;

Solving for the value of (Xᵀ⋅X)⁻¹

Ok we have all the values needed to solve for the matrix B, let’s do it, shall we?

Therefore the solution for the matrix B is;

This matrix can then be used to solve for new predicted values of y, we represent these predicted values as ‘ŷ’ pronounced ‘y_hat’ as a matrix , let’s test our solution;

Huh? but wait, the values of the new matrix ŷ doesn’t match our original matrix y, even though we used the exact same values as our matrix X, was our solution, our matrix B incorrect? Don’t worry nothing was computed or calculated wrongly, and this is a common problem faced in not just regression but all machine learning techniques.

There is always a difference between the predicted and the actual values, this difference is known as the Loss Function.

LOSS FUNCTION AND COST FUNCTION

From our definition above, the loss function is the difference between the predicted and the actual values of a single data point in a dataset. That is the loss function equals;

loss function = (ŷ - y)
where:
ŷ is the predicted value,
y is the actual value.

The loss function is useful for calculating the error associated with the predictions. This measures how well a model’s predictions will match with the actual data.

Cost Function

The cost function of a dataset also known as the objective function, easily put, is a mathematical function that measures the loss functions across a dataset. Therefore, instead of measuring the difference of the predicted and real values of a single data point, we measure this difference across a dataset.

Types of cost functions

Cost functions differ depending on the type of machine learning technique and task. Here are some notable examples of cost functions:

  1. Regression Cost Functions
  2. Classification Cost Functions
  3. Binary Classification Cost Functions
  4. Regularization Cost Functions
  5. Specialized Cost Functions

These are some of the main types of cost functions used in machine learning, the choice of the cost function typically depends on the specific problem and the type of model being applied. However, in this article we will focus only on regression cost functions.

Regression Cost Functions

There are three kinds of regression cost functions, namely

  1. Mean Squared Error (MSE)
  2. Mean Absolute Error (MAE)
  3. Huber Loss

Let’s go through each of them:

. Mean Squared Error (MSE)

MSE measures the average of the squares of the errors between the predicted and actual values. It is widely used in linear regression. The reason why we square the loss function, is to remove any negative values which may lead to incorrect computation of the error values.

The formula for MSE is given as:

Where:

J(θ) is the MSE

m is the number of instances

ŷ is the predicted value

y is the actual value

. Mean Absolute Error (MAE)

MAE is the average of the absolute difference between the predicted values and the actual values across the dataset. As with MSE, the absolute difference is used in order to negate any negative values.

The formula for MAE is given as:

. Huber Loss

Huber Loss combines the MSE for small differences and the MAE for larger differences,by differences, I mean the differences between the predicted and actual values, this makes it more robust to outliers. Outliers are extremities in datasets. This simple image should make this concept easier to understand.

The black colored ball is an outlier. Sourced from Adobe Stock

The goal of any machine learning model is to minimize the cost function, as the smaller the cost function, the more accurate the model. Therefore, this provides a quantitative measure of how well a model is going to perform.

Conclusion

Although reading this will in fact give you an idea of linear regression, it shouldn’t end with this, get as much information as possible concerning linear regression as continuous learning is the only true way to true mastery in machine learning as well as any topic or subject.

Happy Learning!!

--

--