PCA (Principal Component Analysis): A More Detailed Explanation Involving Eigenvalues and Eigenvectors.
Firstly, I want to address a concern close to my heart: my dear coworker has fallen ill. I sincerely hope that she will regain her health as soon as possible. I believe that writing this blog, which is also of interest to her, may uplift her spirits during this challenging time.
Note: This is also my first blog; I didn’t even expect to write my first blog for motivation :)
Let’s first answer this question: “Who cares, brother? Who cares about PCA? Why do we need it?”
Noted: “She also likes this question :)”
If we have a high-dimensional dataset, our machine learning model may require more data due to complexity, leading to overfitting and increased computation time. This is known as the curse of dimensionality problem, which arises when working with high-dimensional data. As the dimensions of the data increase, the possible combinations of features increase exponentially, making it difficult to obtain representative results and causing classification and clustering models to perform poorly on that dataset. To solve this problem, one technique is Principal Component Analysis (PCA), which is used to reduce dimensionality while retaining as much of the original information as possible. For example, if the features explain 97 percent of the variance in the target, the features after PCA will also explain approximately 97 percent.
- It works on the condition that while the data in the higher-dimensional space is mapped to data in the lower-dimensional space, the variance of data in the lower-dimensional space should be maximized.
- PCA converts a set of correlated features to a set of uncorrelated features. Here, we can say that PCA is an unsupervised algorithm used to examine the interrelations between a set of variables without any prior knowledge about the target variable.
- The total variance captured by all the principal components is equal to the total variance in the orginal dataset. The first principal component captures the most variation in the dataset, but the second principal component captures the maximum variance that is the orthogonal to the first principal component, and so on.
- It assumed that information resides in the variance of the features; thus, features with higher variance carry more information.
Note: Two vectors are orthogonal to each other when their dot product is 0.
Let’s go step by step to understand how PCA works.
1. Standardization
First, we need to standardize our dataset to ensure that each variable has a mean of 0 and a standard deviation of 1. This is necessary because PCA is sensitive to the scale of the features.
2. Covariance Matrix Computation
Covariance represents the relationship between two variables. If one variable changes positively (negatively) and the second variable changes positively (negatively), we can say that there is a positive relationship between these two variables. Conversely, if one variable changes positively (negatively) and the second variable changes negatively (positively), we can say that there is a negative relationship. This is similar to correlation. However, the difference is that correlation is bounded between +1 and -1, whereas covariance is not necessarily bounded between +1 and -1.
The covariance matrix displays the relationship between features. For example, if we have two columns (A, B) in a dataset, the covariance matrix will be a 2 by 2 matrix. The values of the matrix will represent covariance(A,A), covariance(A,B), covariance(B,A), and covariance(B,B). From this, we can observe that covariance(A,A) is the variance of A, covariance(B,B) is the variance of B, and covariance(A,B) and covariance(B,A) have the same value. Additionally, it’s worth noting that the transpose of the covariance matrix is equal to itself.
3. Compute Eigenvalues and Eigenvectors of the Covariance matrix to Identify Principal Components.
If you have any square matrix (A) such as 2 by 2, 3 by 3, 4 by 4, etc., and an eigenvector is any vector X that is not equal to zero. Multiplying X by the matrix A results in a multiple of vector X, where that multiplier is lambda (this value is called the eigenvalue), which can be any real number.
To find the eigenvalues, we can use the following formula: If we have matrix A, eigenvector X, and lambda (eigenvalue), and the identity matrix I, we can write:
A*X = X*Lambda = X*Lambda*I. (Here we use identity matrix just for help calculation. It does not change the result.)
A*X = X*Lambda*l
From above equation we extract X*Lambda*l
A*X — X*lambda*I = 0
we get (A — lambda*I)*X = 0.
Calculate its determinant, like this:
Determinant(A — lambda*I) = 0
We set it equal to zero because we believe it is not invertible. Why do we think it is not invertible? Because if we assign (A — lambda*I) the result to be M, we can say that in the null space of M, there exists a matrix X which is not equal to zero, and M times X equals zero. Therefore, it is not invertible.
But the main question here is, who cares, brother? Who cares about these values and vectors?”
The main trick behind it is that we find eigenvectors for a matrix that share the same ratio with the matrix. Let’s explain this in more detail. In the equation A*X = X*lambda, X represents an eigenvector. The ratio in the vectors obtained from multiplying an eigenvector family (X) by different lambda scalar values and multiplying the main matrix (A) by the eigenvectors family (X) will be the same. These eigenvectors can then be used to transform a high-dimensional dataset into a low-dimensional one, where the new columns will represent eigenvectors containing information about correlated features. Finnaly we select the eigenvectors.
4. Dimensionality Reduction
The eigenvectors of the covariance matrix of the data are referred to as the principal axes of the data, and the projection of the data instances onto these principal axes are called the principal components. When we project data instances onto the principal axes, we essentially find the “shadow” of each data point along these axes. The projection of a data instance onto a principal axis represents the magnitude or strength of that data instance’s correlation with that particular direction in the high-dimensional space. Principal components provide a new set of coordinates for each data instance, where each coordinate corresponds to a principal axis. This projection captures how much each data instance aligns with each principal axis. The formula for project the data:
Dimensionality reduction is then obtained by only retaining those axes(dimensions) that contain most of the variance, and discarding all others.
Note: An identity matrix is a square matrix that produces the same matrix when multiplied by another matrix. For example, if A is a 2 by 2 identity matrix and B is a different matrix, multiplying A by B will result in B, because A is the identity matrix.
Note: An invertible matrix is a square matrix with a determinant that is not equal to zero. The product of the matrix and its inverse is equal to the identity matrix. For instance, if A is a matrix and A^(-1) is its inverse, then multiplying A by A^(-1) yields the identity matrix.