Machine Learning — Multivariate Linear Regression

Anar Abiyev
Oct 24, 2020 · 5 min read

Linear Regression is among mostly used Machine Learning algorithms. Univariate Linear Regression is the simpler form, while Multivariate Linear Regression is for more complicated problems.

Image for post
Image for post

This is my second paper about Linear Regression, first one is on Univariate Linear Regression. You can look through it for getting background knowledge on Linear Regression, basics of Machine Learning algorithms, and better understanding of this paper.

The paper contains following topics:

  • Introduction to Multivariate Linear Regression;

Introduction to Multivariate Linear Regression

In ML problems, there are various datasets, they differ from one another with their dimensions (number of rows and columns). As known, columns refer to the features, rows refer to samples, and more features mean more complicated models. In Linear Regression if the number of columns is one, it is called Univariate Linear Regression, if it is more than one, the algorithm is called Multivariate Linear Regression (MLR). The key concepts of Univariate and Multivariate LR algorithms are similar, but there are some differences in the equations, which are caused by various dimensions of datasets.

Hypothesis of the algorithm

In Univariate LR, the equation of the line is used to demonstrate the hypothesis, the hypothesis of the Multivariate LR is similar to that, but it differs with the number of parameters. For simplicity and better understanding, let’s start with an example of dataset, in which there are three features (x₁, x₂, x₃). In this case the hypothesis will be as following:

Image for post
Image for post

For every single feature (x), there is a θ value and extra θ₀. The general form of the equation:

Image for post
Image for post

Here,

  • n — number of features.

The goal of every ML algorithm is to find the most appropriate hypothesis. In MLR hypothesis contain x-s and θ-s. x-s are given in the dataset, so the hypothesis differ from one another with their θ-s. θ-s are initialized at random values, then algorithm starts to optimize their values. Our aim is to calculate the most suitable θ-s for the given dataset. Cost function and Gradient descent are the tools in this manner, which will be talked about in the next sections.

Matrix multiplication and manipulation of the dataset

If the number of features is a big number like hundreds or thousands, the equation above will not be suitable to use, as we have to write every single parameter. That is why, a more proper way — matrix multiplication is applied. This can sound complex, but we will continue step by step and you will notice that it is actually easier than the equation itself.

  1. Let’s review matrix multiplication:
Image for post
Image for post

This is an example to matrix multiplication. The key point is that the number of columns in left matrix must be equal to the number of rows in the right matrix.

For further reading about matrix multiplication.

2. Now we will represent x-s and θ-s with these matrixes. Let’s look at following dataset and theta:

Image for post
Image for post

θ-s become vector (one column matrix):

Image for post
Image for post

3. Third step is the multiplication of these matrices:

Looking back at the matrices, columns of the X matrix is 3, but rows of θ matrix is 4. Yet we cannot multiply these matrices as 3 ≠ 4. This happens because of the extra parameter — θ₀ in the equation of the hypothesis. In order to be able to use matrix multiplication, an extra column of ones is added to the x matrix. After adding the column of ones, the number of columns of x and the rows of θ is equal and we are able to do matrix multiplication:

Image for post
Image for post

Because of the change in the x matrix, we rewrite the hypothesis like below:

Image for post
Image for post
Hypothesis.

To sum up, we did very small change in the hypothesis — added x which is equal to 1 and it is multiplied by θ₀. The result of equation doesn’t change as θ₀ is multiplied by 1. As mentioned above, it is added in order to equalize the number of thetas and x-s to be able to do matrix multiplication.

Cost function

After being done with hypothesis and datasets, we can start the implementation of the algorithm. Cost function is used to determine how well the hypothesis fit to the dataset. For example, we have a pair of x matrix and y value. X and y are given in the training set. We first calculate h(x) with x matrix and θ matrix, then compare h(x) with y. In the ideal case, h(x) is equal to y, but in most cases it is not possible to make h(x) equal to y for every sample of dataset (x matrix). That is why, we need to choose the most optimal solution for the whole dataset.

The fitness of the hypothesis is calculated with Cost function — the less its value is, the better the hypothesis is.

Image for post
Image for post
Cost function.

As was said, the value of the cost function should be minimal, so from this point our purpose is to minimize its value by optimizing θ-s.

Gradient descent

Gradient descent algorithm’s aim is to optimize θ-s, so that the value of the cost function is minimal. As known from Calculus, the derivative of a function is its rate of change from time. So in Gradient descent, we take the derivative of the Cost function with respect to relevant theta, and multiply it with alpha. The acquired value is substituted from the previous value of θ:

Image for post
Image for post
Gradient descent.

Here,

  • α — learning rate;

Gradient descent is applied till the value of the cost function is very small number.

The next section will be about the implementation of Linear Regression on a ML problem.

Thank you.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Anar Abiyev

Written by

Process Automation Engineering Student, Machine Learning Learner

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Anar Abiyev

Written by

Process Automation Engineering Student, Machine Learning Learner

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store