Risks and Caution on applying PCA for Supervised Learning Problems

Souradip Chakraborty
Towards Data Science
8 min readSep 10, 2019

Co-authors: Amlan Jyoti Das, Sai Yaswanth

Reference

High Dimensional Space and its Curse

The curse of dimensionality is a very crucial problem while dealing with real-life datasets which are generally higher-dimensional data. As the dimensionality of the feature space increases, the number of configurations can grow exponentially, and thus the number of configurations covered by an observation decreases.

In such a scenario, Principal Component Analysis plays a major part in efficiently reducing the dimensionality of the data yet retaining as much as possible of the variation present in the data set.

Let us give a very brief introduction to Principal Component Analysis before delving into the actual problem.

Principal Component Analysis-Definition

The central idea of Principal Component Analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of correlated variables, while retaining the maximum possible variation present in the data set.

Let’s define a symmetric matrix A,

where X is an m×n matrix of the independent variables, where m is the number of columns and n is the number of data points. The matrix A can be decomposed in the form of

where D is a diagonal matrix and E is a matrix of eigenvectors of A arranged as columns.

The Principal Components(PCs) of X are the eigenvectors of XX which indicates the fact that the direction of the eigen vectors/ Principal Components are dependent on the variation of the independent variable(X).

Why applying PCA blindly is a curse in Supervised problems ????

The use of Principal Component Analysis in regression has received a lot of attention in literature and have been used widely as a method to handle multicollinearity.

But along with the use of Principal Component Regression , there have been many misconceptions regarding the explainability of the response variable by the Principal Components and their respective order of importance.

The common fallacy that has been made several times even in various papers and books that in the supervised Principal Component Regression framework , the Principal Components of the independent variable having low eigen values will play no part in explaining the response variable which brings us to the very purpose of this blog which is to demonstrate that the components with low eigen values can be as important or even much more important than the Principal Components with larger eigen values while explaining the response variable.

Below are listed down some of such examples as pointed out in

[1]. Mansfield et al. (1977,p. 38) suggest that if the only components deleted are those with small variance then there is very little loss of predictiveness is in the regression.

[2]. In the book by Gunst and Mason (1980), 12 pages are devoted to Principal Component Regression and most of the discussion assumes that deletion of principal components is based solely on their variances. (p.327–328 ).

[3]. Mosteller and Tukey(1977,p. 397–398) argue similarly that the components with small variance are unlikely to be important in regression apparently on the basis that nature is “tricky” but not “downright mean” .

[4]. Hocking(1976,p. 31) is even firmer in defining a rule for retaining Principal Components in regression based on variance.

Theoretical Explanation and Understanding

First, let us give you a proper mathematical justification of the above hypothesis and then we can explain the intuition using geometrical visualisation and simulations.

Let’s say

Y — Response variable

X — Design Matrix — Matrix of feature space

Z — Standardised version of X

Let 𝜆₁≥𝜆₂>…. ≥ 𝜆p be the eigen values of ZZ (correlation matrix) and V be the corresponding eigen vectors, then in W = ZV the columns in W will represent the Principal components of Z. The standard method that is carried out in Principal Component Regression is to regress the first m PCs on Y and the problem can be seen through the below Theorem and its explanation [2].

Theorem:

Let W= (W₁,…,Wp) be the PCs of X. Now consider the regression model

If the true vector of regression coefficients 𝛽 is in the direction of the jᵗʰ eigenvector of ZZ, then when Y is regressed on W, the jᵗʰ PC Wⱼ alone will contribute everything to the fit while the remaining PCs will contribute nothing.

Proof: Let V=(V₁,…,Vp) be the matrix containing the eigenvectors of ZZ. Then,

If 𝛽 is in the direction of the jᵗʰ eigenvector Vⱼ, then Vⱼ = a𝛽, where a is a nonzero scalar. Consequently 𝜃j = Vⱼᵀ𝛽 = a𝛽ᵀ𝛽 and 𝜃ᴋ = Vᴋᵀ𝛽 = 0, whenever k≠j. Therefore, the regression coefficient 𝜃ᴋ corresponding to Wᴋ is equal to zero, for k≠j ,hence

Because, a variable Wᴋ does not produce any reduction in the sum of squares iff its regression coefficient is zero, then Wj alone will contribute everything to fit while the remaining PCs will contribute nothing.

Geometric Significance and Simulation

Let’s now do the simulation and have the geometric understanding of the mathematical intuition. The explanation has been illustrated using simulation for two dimensional feature space (X) and a single response variable so that it is easy to understand the hypothesis visually.

Figure 1 : Univariate and Bivariate plots for simulated variable X1 and X2

In the first step of simulation, the design feature space has been simulated from multivariate normal distribution with very high correlation between the variables and the PCA is implemented.

Figure 2 : Correlation heat-map for PC1 and PC2

It is very clear from the plot, that there is absolutely no correlation between the PCs. The second step is to simulate the values of the response variable y in such a way that the direction of coefficient of Y on PCs is in the direction of the second Principal Component.

Once the response variable is simulated, the correlation matrix looks something like this.

Figure 3 : Correlation heat-map for simulated variable Y and PC1 and PC2

It is very clear from the plot that there is high correlation between y and PC2 rather than PC1 which demonstrates our hypothesis.

Figure 4 : Variance in Feature Space explained by PC1 and PC2

As the figure states that the PC1 explains 95% of the variance in X, so if we go by the logic above, we should completely ignore PC2 while doing the regression.

Let’s follow that and see what happens !!!

Figure 5: Regression Summary with Y and PC1

So an R² of 0 indicates that even though PC1 explains 95% of the variation in X still fails to explain the response variable.

Now let’s try the same thing with PC2 which explains only 5% of the variation in X and see what happens !!!!

Figure 6: Regression Summary with Y and PC2

Whooo!!!! You must be thinking what just happened, the Principal Component which explains around 5% of the variance in X, explained 72% of the variance in Y .

There are some real life scenarios as well to validate the hypothesis as pointed out in

[1].Smith and Campbell (1980) gave an example from chemical engineering where there were nine regressor variables and when the variability of the eighth Principal Component accounts for 0.06% of the total variation which would have been removed based on the low variation criterion.

[2]. A second example is provided by Kung and Sharif(1980). In a study of the prediction of monsoon onset date from ten meteorological variable , the significant Principal Components were eighth, second and tenth in that order. It shows that even the Principal Component with lowest eigen value is the third most significant in terms of explaining the variability of the response variable.

Conclusion : The above examples indicate that it is not advisable to remove the Principal Components with low eigen values as that only indicate the explainability in the feature space and not in the response variable. Hence we should keep all the components and do the supervised learning else we should go for supervised dimensionality reduction methods like Partial Least square regression , Least angle regression which we will be explaining in the upcoming blogs.

References :

[1] Jolliffe, Ian T. “A Note on the Use of Principal Components in Regression.” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 31, no. 3, 1982, pp. 300–303. JSTOR, www.jstor.org/stable/2348005.

[2] Hadi, Ali S., and Robert F. Ling. “Some Cautionary Notes on the Use of Principal Components Regression.” The American Statistician, vol. 52, no. 1, 1998, pp. 15–19. JSTOR, www.jstor.org/stable/2685559.

[3] HAWKINS, D. M. (1973). On the investigation of alternative regressions by principal component analysis. Appl. Statist., 22, 275–286

[4] MANSFIELD, E. R., WEBSTER, J. T. and GUNST, R. F. (1977). An analytic variable selection technique for principal component regression. Appl. Statist., 26, 34–40.

[5] MOSTELLER, F. and TUKEY, J. W. (1977). Data Analysis and Regression: A Second Course in Statistics. Reading, Mass.: Addison-Wesley

[6] GUNST, R. F. and MASON, R. L. (1980). Regression Analysis and its Application: A Data-oriented Approach. New York: Marcel Dekker.

[7] JEFFERS, J. N. R. (1967). Two case studies in the application of principal component analysis. Appl. Statist., 16, 225- 236. (1981). Investigation of alternative regressions: some practical examples. The Statistician, 30, 79–88.

[8] KENDALL, M. G. (1957). A Course in Multivariate Analysis. London: Griffin.

If you have any thoughts, comments or questions, please leave a comment below or contact us on LinkedIn

Stay tuned. Happy reading !!! :)

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Souradip Chakraborty
Souradip Chakraborty

Written by Souradip Chakraborty

Statistical Analyst @WalmartLabs. Masters in Data Science from Indian Statistical Institute. Youngest Speaker@Data Hack Summit Analytics Vidhya’2018

Responses (1)