Demystifying the Mystical: My Foray into the World of AI

Week 6: Multivariate Linear Regression

Published in

ai6-ilorin

9 min readFeb 27, 2020

In the previous week, we figured out how to deal with a simple linear regression task. A dependent variable guided by a single independent variable is important but of less use in real world scenarios, as it might be so difficult for one to see a dataset with just an input feature.

The image above shows a dataset(housing prices) with dependent variable Y(prices in $) and a single independent variable X which is the size in feet. Generally speaking, one dependent variable depends on multiple factors. For example, the price of a house depends on many factors as shown in the image below like the size of it, number of bedrooms, attached facilities, number of floors, the distance of nearest shopping area from it, the neighborhood it is in, etc. Then it becomes increasingly daunting and complex to fit a straight, for all the features must be put into consideration. How do we approach such a task with linear regression? That’s where Multivariate Linear Regression comes into play.

What is Multivariate Linear Regression?

Linear regression as we have seen is a statistical approach to modelling the relationship between a dependent variable and a given set of independent variables. Multivariate linear regression is very similar to simple linear regression discussed earlier, but with multiple independent variables contributing to the dependent variable ( see the image above ). Hence, there are multiple coefficients to be obtained and more complex computation to be done due to the added variables.

If h(x) = θ0 + θ1.x is the hypothesis for the regression task that has only a feature since more features are involved in the multivariate regression, the hypothesis will not contain only two parameters but several parameters that will stretch ahead into the nth dimension depending on the number of features.

In a univariate regression task, the feature is assigned the variable X while the various features are grouped into corresponding X’s from which the hypothesis can be computed.

grouping of features into different variables of X

With this, the hypothesis ceased to be θ0 + θ1.x as it used to be in univariate task and becomes the combination of various variables as shown below

hypothesis for multivariate linear regression

In the multivariate regression setting, because of the potentially large number of features, it is impossible to use the conventional way of linear regression used in univariate regression, hence it becomes an imperative to device other means to handle the task. It is more efficient to use matrices to define the regression model and the subsequent analyses. Here, we review basic matrix algebra, as well as learn some of the more important multiple regression formulas in matrix form.

Matrix and vector formualtion from a given dataset

What is a Matrix?

In mathematics, a matrix (plural matrices) is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns. A matrix is almost always denoted by a single capital letter in boldface type. For example, the dimension of the matrix below is 2 × 3 (read “two by three”), because there are two rows and three columns:

The matrix B is a 5 × 3 matrix containing numbers:

And, the matrix X is a 4 × 4 matrix containing the four columns of various x variables:

Definition of a vector and a scalar

A column vector is an r × 1 matrix, that is, a matrix with only one column. A vector is almost often denoted by a single lowercase letter in boldface type. The following vector y is a 4× 1 column vector containing numbers (the houses prices from the data):

A row vector is an 1 × c matrix, that is, a matrix with only one row.

A scalar is just a real number or quantity. It is often used in the context of vectors or matrices, to stress that a variable such as a is just a real number and not a vector or matrix.

Matrix Addition

The sum of two m-by-n matrices A and B given by A+B is calculated entrywise: (A + B)i,j = Ai,j + Bi,j. Here, there are some restrictions — one can’t just add any two old matrices together. Two matrices can be added together only if they have the same number of rows and columns. Then, to add two matrices, the corresponding elements of the two matrices are simply added together. That is:

When adding two matrices, the rules below should be put into consideration.

Add the entry in the first row, the first column of the first matrix with the entry in the first row, the first column of the second matrix.
Add the entry in the first row, the second column of the first matrix with the entry in the first row, the second column of the second matrix.

Scalar Multiplication

Say there is a scalar denoted by c and a matrix A, the product cA can be computed by multiplying each and every element of A by c. The operation is called scalar multiplication but the output is always a matrix that takes the dimension of the matrix which was initially multiplied by the scalar. That is:

Matrix Multiplication

“Multiplication of two matrices is defined if and only if the number of columns of the left matrix is the same as the number of rows of the right matrix. If A is an m-by-n matrix and B is an n-by-p matrix, then their matrix product AB is the m-by-p matrix whose entries are given by dot product of the corresponding row of A and the corresponding column of B:

Properties of Matrix

Associativity

Matrix multiplication satisfies the rules AB(C) = A(BC) . One can multiply matrix A by matrix B, and then multiply the result by matrix C, or matrix C can be multiplied by matrix B first, and have the result multiplied by the matrix A. It suggests that the grouping surrounding matrix multiplication can be changed. With this change and arrangement, one is bound to arrive at this same answer.

Identity

The n × n identity matrix, denoted by I is a matrix with n rows and n columns. The entries on the diagonal from the upper left to the bottom right are all the same and have the value of 1, while all other entries are 0. The multiplicative identity property as expressed in the image below states that the product of any n × n matrix A and I is always A, irrespective of the order in which the multiplication was performed. In other words, A⋅I = I⋅A = A

Inverse of Matrix

The inverse of a matrix A is denoted by [A]−1 (A exponential -1). For a matrix to be invertible, it must be an m × m matrix — a squared matrix in which the number of rows is the same as that of the of columns. It is defined by the property [A][A]−1=[A]−1[A]=[I], where I is an identity matrix. When A is multiplied by A-1 the result is the identity matrix I.

Not all square matrices have inverses. A square matrix which has an inverse is called invertible or nonsingular, and a square matrix without an inverse is called noninvertible, degenerate or singular.

Transpose

The superscript T has seen in the image below means ‘Transpose’. The transpose of a matrix is an operator which flips a matrix over its diagonal, that is it switches the row and column indices of the matrix by producing another matrix denoted as AT. Another way to look at the transpose is that the element at row r column c in the original is placed at row c column r of the transpose. If A is an m × n matrix, then the transpose is an n × m matrix. Below the matrix A is a 2 × 3 matrix, while the new matrix B which is the transpose is a 3 × 2.

Application of Matrix to Mulitvariate Regression Task

Now, how do we apply the knowledge of matrix to a regression task? If it’s a univariate task, then there won’t be a need for matrix but a multivariate task relies on linear algebra and matrix operation. Take for example the image below, the given dataset contains features that all play part in the prediction of the dependent variable, it is only rational to consider all the features before any reasonable and genuine prediction can be made.

As you can see, there is a pattern that emerges. By taking advantage of this pattern, we can instead formulate the simple linear regression function in matrix notation. Where the hypothesis, as opposed to the univariate regression task, is a sum not just of two thetas but of more than three thetas…up to the nth dimension(that is the parameters will be θ0, θ1, θ2, θ3,…θn)

Now that there is a slight change to the Hypothesis from that of univariate regression, the Cost function and the gradient descent differ too. Here:

Cost function for multivariate linear regression

Gradient descent for Multivariate regression task

Feature Scaling

Feature scaling is a method used to normalize the range of independent variables or features of data. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Feature scaling is also applied so that the gradient descent converges and reach the global optima much faster. With feature scaling, gradient descent can do so faster than without it.

There are many methods by which feature scaling can be done in machine learning, namely: Rescaling(min-max normalization), standardization, scaling to unit length, mean normalization, etc.

Mean Normalization

Mean normalization is one of the methods used in feature scaling. The way it can be computed is shown below.