Dimensionality Reduction in Supervised Framework and Partial Least Square Regression

Souradip Chakraborty
Analytics Vidhya
Published in
7 min readOct 3, 2019

Author — Souradip Chakraborty

Fig1: High Dimensional Space and the curse of Dimensionality

High Dimensional Space and the Surprising Behaviour of majority of Distance Metrics in that space

I have always been very enthralled by the concepts of High Dimensional space and its repercussions in the field of Machine Learning . As we know that as the dimensionality of the feature space increases, the number of configurations can grow exponentially and thus the number of configurations covered by an observation decreases. The above explanation is related to the Degree of Freedom school of thoughts but now let’s think the curse of high dimensionality from a different perspective — The distance metric school of thoughts.

Fig 2 : Euclidean Distance and Supervised Learning

Our intuitions, which are majorly developed based on two-dimensional and three-dimension visualisation of things often do not apply in high-dimensional ones. In higher dimensions, most of the mass of a multivariate Gaussian distribution is not near the mean, but in an increasingly distant “shell” around it. If a constant number of examples is distributed uniformly in a high-dimensional hypercube, beyond some dimensionality most examples are closer to a face of the hypercube than to their nearest neighbour and the above have been explained aptly in [1].

At higher dimensional space , Euclidean distance loses pretty much all its meaning. Basically what happens in a very high dimensional space is that the pairwise distance between the points approaches a constant value and it gets extremely hard to differentiate or cluster those high dimensional data points and hence it becomes extremely necessary to project the same in a lower dimension manifold to avoid the curse of dimensionality.

Curse of Dimensionality Reduction in Supervised Learning and the Saviour!!!

In the above scenarios, Principal Component Analysis (PCA) has been considered as the most commonly used approach in reducing the dimensionality of the data yet retaining as much as possible of the variation present in the data set.

Fig 3: Principal Component Analysis and Feature Space

Having said so, one of the major shortcomings of PCA and Principal Component Directions (PCs) is its applicability in supervised learning problems as discussed in our previous blog on ‘Risks and Caution on applying PCA for Supervised Learning Problems[2]. The common fallacy made several times in various past research is that the PC direction with the maximum eigen value will be the most important direction in explaining the response variable and the one with the least eigen value will be the least important. But this is not at all correct since the magnitude of the eigen values represents the explainability in the feature space and not in the response variable space. Hence, the approach of reducing the dimensionality with variance decomposition of the feature space is not the most appropriate way to obtain the projection, rather we should go for an approach where the objective function takes care of both the components i.e explaining the variance in the feature space as well as the response space. This is where Partial Least Square Regression comes in picture as a saviour !!!!

Partial Least Square — Intuition and Understanding

As discussed in the above section, the major problem is rotation and dimensionality reduction to explain the maximum variation in X is not guaranteed to yield latent features that are good for predicting y and hence, the basic objective of Partial Least Square Regression(PLSR) is to project the data in a latent variable space in such a way that it maximises the covariance between feature space X and response Y.

Before starting with the process, the response variable Y should be centred and the feature space X should be standardised so that there is no magnitude related effects on the output components.

So for the first latent vector, search for a vector t = Xw such that

Fig 4: Objective Function for PLS Regression

So, from the above equation w is a unit vector that maximises the covariance between Xw and Y .

Fig 5: Optimal Direction for w

So, to maximise the covariance , w should be in the direction of XᵀY.

Fig 6: New Latent Feature u

The intuitive understanding of the above equation is that we are projecting the response variable Y into each of the feature vector Xj and observing that how much of the variation in Y can be explained by each Xj alone and then using all of that and adding them to make the first latent feature vector t.

So, the basic idea is we take Y, find the projection of Y along X₁ and find the projection of Y along X₂ and the resulting direction which is the sum of the two will be the first PLS direction.

Now, the next step is to regress Y on the first PLS component t and obtain the coefficient θ_hat .

Fig 7: Orthogonalisation and new Feature space for next steps

Then the feature space is orthogonalised w.r.t to the first PLS component t and Xj’ is obtained for each j ,which is basically the new feature space for the next step of the algorithm. The same process is continued as mentioned above with the new feature space and t’ is obtained.

The major advantage of the process is that the components are orthogonal to each other and hence regression on Y can be done individually with univariate regression models.

The above was a very brief mathematical and intuitive explanation of the power of dimensionality reduction in supervised learning using PLS components. Now let’s validate the above hypothesis with data.

Partial Least Square Regression — Validation of Hypothesis

Let’s now do the simulation and have the geometric understanding of the mathematical intuition. The explanation has been illustrated using simulation for two dimensional feature space (X) and a single response variable so that it is easy to understand the hypothesis visually.

Fig 8: Univariate and Bivariate plots for simulated variable X1 and X2

Our objective is to show that for supervised problems, PLS component is the recommended dimensionality reduction technique than Principal component based method as it incorporated the response variable space as well and the y is simulated keeping that in mind. Refer our blog [2] for detailed explanation on that.

So, now in this example we will project our feature space in 1st PLS space and compare the results with the same done on 1st Principal Component space to show its superiority.

Fig 9: Regression of Y and 1st PLS component

As, it can be seen PLS1 is able to explain Y with a reasonably good accuracy which can be captured by the span of X. Let’s see the results for 1st Principal component on the same dataset.

Fig 10: Regression of Y and 1st Principal Component

The results shown in Fig 9 & 10 validates our hypothesis on the use of PLSR in supervised problems over feature space based dimensionality reduction techniques.

Partial Least Square Regression — Diversity in Application

Apart from the above mentioned advantages, PLS algorithm has its applications for regression, classification, variable selection, and survival analysis problems covering genomics, chemometrics, neuroinformatics, process control, computer vision, econometric, environmental studies, and so on. The paper on ‘Application of Partial Least-Squares Regression in Seasonal Streamflow Forecasting’ [4]explains its applicability in that field as well. On a concluding note, PLS components are much more reliable and applicable in Supervised Learning scenarios.

Final Thoughts :

I hope this post was able to make you understand the concepts of Curse of Dimensionality from a Distance metric point of view and why it is important to reduce the dimensions and also the mathematical understanding and intuition of using Partial Least Square based approach to reduce the dimensionality of the feature space in presence of response variable.

If you have any thoughts, comments or questions, please leave a comment below or contact on LinkedIn.

Happy reading :)

References :

[1]. Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim’s paper on On the Surprising Behavior of Distance Metrics in High Dimensional Space’ .

[2]. Souradip Chakraborty, Amlan Jyoti Das & Sai Yashwanth’s blog on ‘Risks and Caution on applying PCA for Supervised Learning Problems’ . (https://towardsdatascience.com/risks-and-caution-on-applying-pca-for-supervised-learning-problems-d7fac7820ec3).

[3]. Bob Collins’ presentation on Partial Least Square Regression (http://vision.cse.psu.edu/seminars/talks/PLSpresentation.pdf).

[4]. Shalamu Abudu ,J. Phillip King, and Thomas C. Pagano’s paper on ‘Application of Partial Least-Squares Regression in Seasonal Streamflow Forecasting’.

--

--

Souradip Chakraborty
Analytics Vidhya

Statistical Analyst @WalmartLabs. Masters in Data Science from Indian Statistical Institute. Youngest Speaker@Data Hack Summit Analytics Vidhya’2018