Principal Component Analysis (PCA)
PCA is Dimensionality Reduction technique used widely to analyse the dataset with very high cardinality. It is used to reduce the number of dimensions of the data by reducing the number of components and maximising the information present. PCA can be used for tasks like Dimensionality Reduction and Visualization in Data Science.
What is Dimensionality Reduction :
Dimensionality Reduction is a technique to transform a given dataset from high dimensional space to low dimensional space, where it can intuitively visualized by the analysts.
Working with High Dimensional data can be undesirable because of following reasons :
- Data might be very sparse.
- Most of the columns does not provide any extra information or correlated to some existing column in the data.
- Analysing high dimensional data can be really computationally expensive.
- It may lead to curse of dimensionality.
Dimensionality Reduction Techniques :
1. Feature Selection Methods
2. Matrix Factorization
- Principal Component Analysis
3. Manifold Learning
- Kohonen Self Organizing Map (SOM)
- Sammons Mapping
- Multidimensional scaling
- t-distributed Stochastic Neighbour Embedding
4. Autoencoder Methods
Who invented PCA ?
PCA was invented in 1901 by Karl Pearson, as an analogue of the principal axis theorem in mechanics. It was later independently developed and named by Harold Hotelling in the 1930s.[
Depending on the field of application, it is also named :
- The discrete Karhunen–Loève transform (KLT) in signal processing,
- The Hotelling transform in multivariate quality control
- Proper Orthogonal Decomposition (POD) in mechanical engineering
- Singular Value Decomposition (SVD) of X (invented in the last quarter of the 19th century), Eigenvalue Decomposition (EVD) of XTX in linear algebra,
- Factor Analysis (for a discussion of the differences between PCA and factor analysis
- Eckart–Young theorem (Harman, 1960), or Empirical Orthogonal Functions (EOF) in meteorological science
- Empirical Eigenfunction Decomposition (Sirovich, 1987), Empirical Component Analysis (Lorenz, 1956), Quasi-harmonic Modes (Brooks et al., 1988), Spectral Decomposition in noise and vibration, and empirical modal analysis in structural dynamics
How does PCA work ?
We need to learn about few data pre-processing techniques in order to understand the internal working of PCA.
Column Normalization :
Normalization is scaling technique where the value a column is transformed in such a way that, the transformed values always lie between 0–1. It can be done using a technique called Min-Max Scaling.
let’s say we have feature column A with Amin as min of A and Amax as max of A, then we can transform each individual value of A as per below equation:
Column Standardization :
Standardization is a scaling technique where the aim is to transform the column vector such that values are centred around the mean with unit standard deviation. It can be derived using the formula below where mu is mean of the feature and sigma is the standard deviation of the feature values.
Why do we need Standardization before applying PCA ?
Standardization is done prior to PCA as objective function of PCA can be quite sensitive to the initial range of the features. The features with high variance will dominated over the low variance features event though low variance feature might be more informative. When we standardize all the features, every feature has similar mean and variance thus high variance feature domination problem is resolved.
Covariance Matrix :
Covariance matrix is a N * N symmetrical matrix where N is the number of dimensions. The value of covariance matrix can be interpreted as variance of the input variables from each of its mean. This would let us understand the relationship and correlation between the two variables.
The Covariance of a feature with itself is nothing but the the variance of the the feature. So basically, diagonal of the co-variance matrix would be the having the variance of the feature columns.
The equation below calculated the covariance term for a pair of feature f1, f2 , where mu1 is the mean of f1 and u2 is the mean of f2.
Now since we have the feature column standardized, the mean of both the columns will become 0, and the equation changes to.
Principal Component Analysis :
The aim of the principal component analysis is to find the direction with maximum variance within the feature space.
In the figure we can see a 2-D Data, to transform it into 1-D we would need to rotate the axes by a certain amount (Theta) and find the spread on the new axes. We drop the axes with the less variance and keep the axes with more variance. i.e. f1' in the above picture. After all these steps we would project the x{i)s to onto f1' .
Objective Function:
We need to find the unit vectors(u1) in the direction of maximal spread.
Let the Dataset be D.
The projected data will be :
Objective Function :
The solution of the above mathematical equation can be achieved using the Eigen Value decomposition. This will give eigen values and eigen vectors.
y1, y2, y3 …. yN are eigen values and u1, u2, … , uN will be the eigen vectors. The variance ratio is defined as per below equation and tells us the percentage of information preserved in new dimensional space as per the number of components.
Limitations of PCA :
- Linearity: PCA assumes that the principal components are the linear combination of the original features. If this assumption is not true, then it will give us misleading results.
- Large variance mean more structure: PCA tries to preserve the global structure instead of a local structure of the data by taking variance as the measure of how important a particular dimension is. Which might sometimes result in some info loss when the data is not properly spread
- Orthogonality: PCA also assumes that the principal components are orthogonal to each other. There is some info loss if axes are orthogonal to each other.