This article is a summary and exploration of the research paper “Which principal components are most sensitive to distributional changes?” by Martin Tveten. This research paper explores how the principal components of data generated from a distribution change when the underlying distribution is changed. In particular, it looks at which of the PCA factors are the most sensitive to the changes in the distribution. The key results from the paper are:
- For bivariate data, the minor projections of the PCA-rotated data are most sensitive to changes in the distribution of data, especially if the change is sparse.
- For higher dimensional data, Monte Carlo simulation results concur with the observation from bivariate data.
Now we can delve into how those results were derived, and under what premise. The general problem (expressed for bivariate data to ease the notation) is as such:
Consider n independent observations of x in D dimensions. For t from 1 to k, where k is more than 1 and less than n, the distribution has mean mu0=0 and covariance sigma0, and for the data from k+1 to n, the distribution has mean mu1 and covariance sigma1. For D=2, we get the following formulation:
Now we look at the ordered normalized eigensystem of Sigma0,
We consider the projections to be
So the general problem is to find out which projections are most sensitive to different distributional changes defined by our means and covariances. We define sensitivity to changes as the normal Hellinger distance between the marginal distribution of a projection before and after the change. The squared Hellinger distance between two normal distributions is given by
In our case, the marginal distributions are
Results (2 Dimensions)
The first scenario is when only the mean changes. In particular,
- If one of the changed mean is 0 while the other is not, H2 > H1
- If the mean changes equally in opposite directions, H2 > H1
- If the mean changes equally in the same direction, H1 > H2
The next scenario is when only both variances change equally, then H1 = H2. However, when only one variance changes, the situation gets more complicated. When the variance decreases, the principal component is most sensitive, unless the pre-change correlation is high (>0.87).
The final scenario is when only one variance changes. If the variance increases, then H2 > H1, and when the variance decreases, H2 < H1. For a change in correlation, the minor projection is the most sensitive is most cases.
Results (Higher Dimensions)
Working out the inequalities becomes intractable when dealing with high dimensions, so Monte Carlo simulations were done to produce results. The simulation results show that for higher dimensions, the results we derived for the 2 dimensional case still hold true.
This article addresses an important issue about how the factors in PCA change when the distribution of the data changes. In real life applications, the distribution from which the data is generated from is mostly not constant, so the knowledge of how the factors change will be useful to make corrections effectively. Specific applications can include change point detection where we can use the changing behavior of the factors to estimate if the underlying distribution has changed.