Statistics and Math
Gentle introduction of Multiple linear regression
Detailed Mathematical derivation and the Python implementation of multiple linear regression
Introduction
When we analyze real-world data, more than simple linear regression is needed because the data generally has multiple variables. So, we need to use multiple linear regression. If we use Python, we can quickly implement and run codes using something library like scikit-learn. However, if we understand the theorem of multiple regression, it is hard to understand the derivation of multiple linear regression since it is full of linear algebra. So, in this blog, I will first give you an overview of multiple linear regression using Python. After we review the concept, I will gently introduce the mathematical detail of multiple regression.
Table of Concepts
1. Overview of Multiple Linear Regression
Let’s dive into the world of multiple linear regression. As the name suggests, it’s all about using multi-dimensional data (columns) to predict a one-dimensional outcome. In contrast, simple linear regression is about using single-dimension data (columns) for the same purpose. The data we use for prediction is called the dependent variable, while the outcome is the independent variable. The key difference between the two is the number of dependent variables. To help you visualize this, take a look at the diagram below.
In a mathematical formula, simple and multiple linear regression can be described as follows:
Algebra can sometimes be challenging for beginners, so let’s implement multiple linear regression using a real-world dataset and Python. We will use the student performance dataset [1]. Here is the first five rows of this dataset.
This dataset contains four dependent variables (”Hours Studied”, “Previous Scores”, “Extracurricular Activities”, “Sleep Hours”, and “Sample Question Papers Practiced”) and one independent variable (”Performance Index”). In this blog, I will not discuss EDA so that I will use all dependent variables. We can write this dataset setting using the mathematical formula as follows.
Now, we need to find the coefficients using the given data. We can easily find them using the scikit-learn or statsmodels libraries. I will show you two ways of solving multiple linear regression.
As you can see, using Python libraries allows us to implement multiple linear regression with only a few lines of code. In my opinion, statsmodel is more convenient because it provides a summary of some statistics, such as p-value, t-value, and standard error. We can easily see the importance of each dependent variable using the coefficient values. The bigger the value, the more it tends to affect the independent variable. In this case, “Hours Studied” and “Previous Scores” have a more significant influence on the “Performance Index” than other variables. However, multiple linear regression sometimes produces misleading results because of its multicollinearity. To understand it, we also need to deeply understand multiple linear regression in mathematics. I will later explain it.
Now that we understand the feeling of multiple linear regression, how does the algorithm work to estimate coefficients? There are two ways to calculate them analytically: the least squared method and maximum likelihood estimation. Let’s dive into them in the following sections.
2. Parameter Estimation via Least-squared Method
We can estimate the parameters of multiple linear regression using the least-squared method, like simple linear regression. The data for multiple linear regression has many dependent variables, so it is convenient to use linear algebra. However, it is difficult for beginners to understand so that I will write down linear algebra and matrix form. For the first step, let’s describe the multiple linear regression formula using the linear algebra and matrix form. We assume the settings below.
You can check the dimension of the multiple linear regression formula in each variable. We can denote the multiple linear regression equation using the above settings as shown below.
It is a good method to imagine each matrix dimension to get used to the linear algebra form.
Next, the least-squared method aims to minimize the difference between predicted and original data values. I will use the hat symbol for predict parameters and values to distinguish the original data values. We can describe the least-squared estimation as follows:
I will later introduce the necessary linear algebra formulas in the appendix. Since we want to minimize the least-squares loss function, we take a derivative with respect to the parameters.
So, if XᵗX has an inverse matrix, we can obtain the above parameter. Thus, XᵗX needs to meet that determinant is not zero, meaning all X’s columns are linearly independent. Let’s check that we can obtain the same values using the previous example.
X_ = np.hstack((np.ones((len(X_train), 1)), X_train))
A = X_.T @ X_
beta_hat = np.linalg.inv(A) @ X_.T @ y_train
print(beta_hat)
# It must be the value below.
# [-34.02388917 2.85188031 1.01834736 0.60021182 0.47576013
# 0.19210505]
As you can see, we can get the same value. For your information, what things will happen when the XᵗX is a singular matrix? I add twice the value of the Previous Score column to implement a linearly dependent matrix quickly.
dummy = (2 * X_train[:, 1]).reshape(len(X_train), 1)
tmp = np.hstack((dummy, X_))
tmp_A = tmp.T @ tmp
inv_A = np.linalg.inv(tmp_A)
print(inv_A)
# np.linalg.inv throws an error that LinAlgError: Singular matrix
According to the above code, we cannot compute the parameter when the XᵗX is a singular matrix.
In the next section, we will derive the parameters of multiple linear regression via the maximum likelihood estimation.
3. Parameter Estimation via Maximum Likelihood Estimation
In this section, we derive the parameters of multiple linear regression using the maximum likelihood estimation. Firstly, we assume the error term follows the multivariate gaussian distribution Nₙ with the mean vector equal to 0 and the covariance matrix equal to 𝜎²Iₙ.
When we substitute ε to x, 𝜎²Iₙ to 𝚺 and 0 to µ in the above equation, the maximum likelihood function can be described as follows:
In the last equation, we take a logarithm to make it easy to treat. We want to maximize it (the log-likelihood function), so we take a derivative with respect to the parameter β.
This derivation is the same as the least-squared method. So, when we derive the parameters using the least-squared method, we spontaneously consider that the error term follows the multivariate Gaussian distribution.
4. Multicollinearity
In the last section, we will explore the critical phenomena of multiple linear regression: multicollinearity. Multicollinearity occurs when two or more dependent variables are linearly dependent, e.g., in the cases below.
If we have multicollinearity for our data, what will happen? Let’s see the effects of multicollinearity. For simplicity, we consider a multiple linear regression with two parameters.Moreover, the dependent variables x are normalized to have unit length, and the independent variable is centered.
So, XᵗX can denote as:
The variance of parameters can be described as:
So, if the correlation is close to 1, which means the two variables are correlated, the denominator becomes 0, and this formula cannot be calculated (since the calculation result becomes ∞). Thus, the value of parameters becomes unstable because the variance of parameters is too large. This is the effect of multicollinearity and the reason why we should remove it.
To detect multicollinearity, we can use the methods below.
- Scatterplot / correlation matrix
- Variance inflation factors (VIFs)
- Condition number of the correlation matrix
I skipped the explanation of detailed mathematical derivations of them in this blog, but you can check the reference [2]. In Python, you can quickly implement them. You can see the implementation of the section “Detecting multicollinearity”.
There seems to be no correlation in a correlation matrix. In VIFs, the previous scores and the sleep hours have higher values than the others. A rule of thumb for VIFs is that if the value of VIFs is larger than 10, there is multicollinearity in high probability. Also, the condition number of XᵗX is large, and the dummy correlated XᵗX is larger than the original one. However, Ridge regression is complex so that I will introduce it in another blog.
Appendix: Linear Algebra Cheat-sheet
These formulas are necessary linear algebra knowledges for the parameter estimation of the multiple linear regression.
The proofs are here:
We have discussed the mathematical derivation of multiple linear regression and its Python implementation. I hope these mathematical derivations help you understand multiple linear regression. Thank you for reading!
References
[1] Student Performance dataset, kaggle
[2] Multicollinearity, San Jose State University lecture