Multivariate Linear Regression

Swaathi Sundaramurugan
Analytics Vidhya
Published in
7 min readOct 17, 2021

A real-world dataset always has more than one variable or feature. When a regression problem has more than one feature/variable to consider for the outcome, then it is called multivariate regression.

The hypothesis function of a single variable is,

In the case of multiple variables n, x becomes a vector or a column containing n variables as vector elements.

For n features, the parameters also become a vector,

The hypothesis function of multiple variables,

For our convenience, we add a zeroth feature of x which is always 1. Hence, the hypothesis function becomes,

In vector form, the hypothesis function can be written as,

Gradient Descent for Multiple Variables

The Gradient descent for multiple variables can be found used this algorithm,

Repeat till convergence:

and simultaneously update,

How to check if gradient descent is working properly?

The iterations with which the gradient descent is obtained can vary from problem to problem. Some can take a very few numbers of iteration to reach the global minimum and some may take a large number of iterations.

To check if they are working properly, calculate the gradient descent and plot the value of loss function for the particular value of the parameter on the graph.

If gradient descent is working properly, then the value of the loss function should decrease with every iteration.

If the plot seems to increase with the number of iteration, then the gradient descent is not working as expected. We have to choose a smaller alpha value in this case.

Sometimes, the plot might decrease and increase repeatedly with iteration. This scenario also indicates that the alpha value is high and we have to decrease the alpha value so that the parameter would converge.

To get the optimum alpha value, we can plot the gradient descent graph for a range of alpha values. Whichever value causes the loss function to decrease rapidly, we can finalize that value for alpha.

Other than this method, we can also use the Automatic convergence test to check if the loss function converges. A threshold value of say 0.001 is chosen and if the loss function decreased by less than the threshold value, then we can assume that the function is converging. But in practice, it’s difficult to choose the threshold value, and the previous method works well to find the efficiency of the gradient descent.

Feature Scaling

Let’s assume that we have a feature x1 between the range (0–5000) and a feature x2 between the range (1 -10). If we take a gradient descent for this function, the cos function would have tall skinny contours (if represented in the contour graph). It takes a lot of time to reach the global minimum from any point on the function.

Dividing the values of the features by their range of value (maximum value of feature — minimum value of feature) is one of the methods of scaling.

Generally, we can get every feature into the range that is between 2 to 6 values or anything between. For example,

Mean Normalization

One of the methods of scaling where xi is replaced with the difference of x_i and the average value of feature x. This makes features have approximately zero mean.

Features and Polynomial Regression

We don’t have to mandatorily use all the given features. As per our problem statement, we can always neglect some features or add new features (which can be a combination of other features).

Similarly, we don’t have to fit a straight line to the data as the actual outputs might not lie on a straight line. We can go for polynomial functions, to fit the dataset better.

A straight line fit,

We can create new features based on existing features like this for example,

As mentioned earlier, we can try different polynomial functions to fit the data better and compare them to see which one fits the best.

A quadratic fit,

A cubic fit,

A square root fit,

While applying these polynomial functions, it’s important that the values of the feature lie within a smaller range. Hence, feature scaling must be done carefully.

Normal Equation

Gradient descent is one of the methods of finding the lowest parameter values at which the loss function can be minimized. The value of the parameters changes for every iteration till the global minimum is identified. The normal equation tries to find the least parameter values without continuous iteration.

In general, let’s consider a math quadratic function. We solve the function by equating it to zero.

Similarly, in the Normal equation, we take the partial derivative of the loss function and assign it to zero for every parameter and solve for each parameter.

By solving using calculus, we can get the minimum value by,

where,

If we are using the Normal Equation methods, it’s not necessary that we perform feature scaling to bring the range of the feature values to within a few values.

While comparing Normal Equation and Gradient Descent,

  • We don’t have to choose the learning rate alpha and go through a large number of iterations in a Normal Equation. Hence, it saves a lot of time and complexity.
  • But Gradient descent works well with a large number of features. While computing Normal equation is slower if the n is large and it almost costs,

as we need to calculate,

In practice, if the number of features exceeds 10,000 then we can consider moving to gradient descent.

Normal Equation and Non-invertibility

When we calculate the minimum loss function by finding the minimum parameter values using the Normal Equation method by using this equation,

Sometimes, the inverse of the X (transpose) X is not possible if the resultant value is a singular or degenerate matrix. In order to solve this situation, we have to take the pseudo inverse. The pseudo-inverse can be taken by decomposing the resultant non-invertible matrix with the Singular Value Decomposition (SVD) method.

We can also convert the non-invertible resultant matrix to an invertible matrix by figuring out the reasons behind its singularity.

  • Sometimes there might be redundant features making the matrix linearly dependent. We can remove or change those features.
  • Sometimes there can be too many features (The number of training data can be less than or equal to the number of features). We can delete some features in this case.

Note: This article is a part of #30DaysOfData and the contents of the article are my own notes from the Andrew NG Machine Learning Course.

--

--

Swaathi Sundaramurugan
Analytics Vidhya

Data Engineer Intern | Graduate Student at Simon Fraser University | Full Stack Developer | Writer