Machine Learning — Multivariate Linear Regression
Linear Regression is among mostly used Machine Learning algorithms. Univariate Linear Regression is the simpler form, while Multivariate Linear Regression is for more complicated problems.

This is my second paper about Linear Regression, first one is on Univariate Linear Regression. You can look through it for getting background knowledge on Linear Regression, basics of Machine Learning algorithms, and better understanding of this paper.
The paper contains following topics:
- Introduction to Multivariate Linear Regression;
- Hypothesis of the algorithm;
- Manipulation of the dataset and matrix multiplication;
- Cost function;
- Gradient Descent.
Introduction to Multivariate Linear Regression
In ML problems, there are various datasets, they differ from one another with their dimensions (number of rows and columns). As known, columns refer to the features, rows refer to samples, and more features mean more complicated models. In Linear Regression if the number of columns is one, it is called Univariate Linear Regression, if it is more than one, the algorithm is called Multivariate Linear Regression (MLR). The key concepts of Univariate and Multivariate LR algorithms are similar, but there are some differences in the equations, which are caused by various dimensions of datasets.
Hypothesis of the algorithm
In Univariate LR, the equation of the line is used to demonstrate the hypothesis, the hypothesis of the Multivariate LR is similar to that, but it differs with the number of parameters. For simplicity and better understanding, let’s start with an example of dataset, in which there are three features (x₁, x₂, x₃). In this case the hypothesis will be as following:

For every single feature (x), there is a θ value and extra θ₀. The general form of the equation:

Here,
- n — number of features.
The goal of every ML algorithm is to find the most appropriate hypothesis. In MLR hypothesis contain x-s and θ-s. x-s are given in the dataset, so the hypothesis differ from one another with their θ-s. θ-s are initialized at random values, then algorithm starts to optimize their values. Our aim is to calculate the most suitable θ-s for the given dataset. Cost function and Gradient descent are the tools in this manner, which will be talked about in the next sections.
Matrix multiplication and manipulation of the dataset
If the number of features is a big number like hundreds or thousands, the equation above will not be suitable to use, as we have to write every single parameter. That is why, a more proper way — matrix multiplication is applied. This can sound complex, but we will continue step by step and you will notice that it is actually easier than the equation itself.
- Let’s review matrix multiplication:

This is an example to matrix multiplication. The key point is that the number of columns in left matrix must be equal to the number of rows in the right matrix.
For further reading about matrix multiplication.
2. Now we will represent x-s and θ-s with these matrixes. Let’s look at following dataset and theta:

θ-s become vector (one column matrix):

3. Third step is the multiplication of these matrices:
Looking back at the matrices, columns of the X matrix is 3, but rows of θ matrix is 4. Yet we cannot multiply these matrices as 3 ≠ 4. This happens because of the extra parameter — θ₀ in the equation of the hypothesis. In order to be able to use matrix multiplication, an extra column of ones is added to the x matrix. After adding the column of ones, the number of columns of x and the rows of θ is equal and we are able to do matrix multiplication:

Because of the change in the x matrix, we rewrite the hypothesis like below:

To sum up, we did very small change in the hypothesis — added x₀ which is equal to 1 and it is multiplied by θ₀. The result of equation doesn’t change as θ₀ is multiplied by 1. As mentioned above, it is added in order to equalize the number of thetas and x-s to be able to do matrix multiplication.
Cost function
After being done with hypothesis and datasets, we can start the implementation of the algorithm. Cost function is used to determine how well the hypothesis fit to the dataset. For example, we have a pair of x matrix and y value. X and y are given in the training set. We first calculate h(x) with x matrix and θ matrix, then compare h(x) with y. In the ideal case, h(x) is equal to y, but in most cases it is not possible to make h(x) equal to y for every sample of dataset (x matrix). That is why, we need to choose the most optimal solution for the whole dataset.
The fitness of the hypothesis is calculated with Cost function — the less its value is, the better the hypothesis is.

As was said, the value of the cost function should be minimal, so from this point our purpose is to minimize its value by optimizing θ-s.
Gradient descent
Gradient descent algorithm’s aim is to optimize θ-s, so that the value of the cost function is minimal. As known from Calculus, the derivative of a function is its rate of change from time. So in Gradient descent, we take the derivative of the Cost function with respect to relevant theta, and multiply it with alpha. The acquired value is substituted from the previous value of θ:

Here,
- α — learning rate;
- m — number of samples.
Gradient descent is applied till the value of the cost function is very small number.
The next section will be about the implementation of Linear Regression on a ML problem.