Dimensionality Reduction by Stochastic Gradient Descent

Subodh Lonkar
Analytics Vidhya
Published in
8 min readAug 27, 2020

--

This post demonstrates the use of Stochastic Gradient Descent for Dimensionality Reduction.

What is Dimensionality Reduction?

Dimensionality reduction is the process of reducing a potentially large set of features F to a smaller set of features F’ to be considered in a given machine learning or statistics problem.

In an unsupervised setting, dimensionality reduction is often used for exploratory data analysis, for example to visualize the distribution of high dimensional data in human-digestible two or three dimensions. In a supervised setting, the main use is to reduce the number of parameters a learning machine has to determine. In other words: The goal of dimensionality reduction is to overcome the curse of dimensionality.

A straightforward approach, feature selection, is to select only certain features, i.e., to choose a selection F’ subset of F, according to some optimality criterion, e.g., related to the information content of the features in F’ in regard to the data. More general approaches find F’ by mapping the feature space Fs that is defined by F into a lower-dimensional space Fs’ corresponding to F’.

Here, we will focus on the latter category in a supervised setting. In particular, we present a method to learn a linear mapping that approximately maximizes the distance between classes.

What are the Benefits of applying Dimensionality Reduction on our Dataset?

  1. Less dimensions implies less computation time.
  2. Helps in removing redundant features. Removing redundant features helps take care of multicollinearity.
  3. Higher dimensions may lead to the problem of “Curse of Dimensionality”.
  4. It is difficult to visualize data having high dimensions.
  5. Less dimensions implies less storage space for storing data.
  6. Excessive dimensions may tend to find unnecessary information from Train Data and might fail to generalize well on Test Data. That is, it might lead to Overfitting.

Related Work

Arguably the two most well known dimensionality reduction techniques are the unsupervised principal component analysis (PCA) and Fisher’s supervised linear discriminant analysis (LDA).

Briefly, PCA finds a linear mapping so that the squared projection error is minimized. The effect is that most of the data’s variance in the original feature space is also represented in the reduced space Fs’. While PCA does preserve variance, it does not necessarily preserve class separability in the projected space.

LDA, on the other hand, finds a projection so that the scatter between samples of different classes is maximized while the scatter between samples of the same class is minimized. The projection does not necessarily represent the high dimensional variation of the data well, but it does preserve information necessary to separate the classes. However, LDA makes the implicit assumption that the data of a given class are normally distributed. Coupled with the restriction that the dimensionality of Fs’ is one less than the number of classes C, dim(Fs’) = C − 1, this means that too much information may be discarded and the classes are not separable in Fs’.

There are also nonlinear methods that project the data onto a low dimensional manifold embedded in F. Examples of such methods include multidimensional scaling, Isomap, local Fisher discriminant analysis and t-distributed stochastic neighbor embedding.

Dimensionality reduction also has close ties to metric learning, where the goal is to learn a function that realizes a distance from a given dataset. The metric is then to be used in a subsequent distance-based classifier like nearest centroid or k-nearest neighbor. Examples of metric learning methods include large margin nearest neighbor and information-theoric metric learning.

Some other popular Dimensionality Reduction Techniques

t-distributed Stochastic Neighborhood Embedding

Neural Autoencoder

Forward Feature Selection

Backward Feature Elimination

Ratio of missing values

Low variance in column values

High correlation between two columns

Methods

Our method can be formalized as follows: Given a set of n-dimensional training data D subset of R^n and a partition of this set into two disjoint classes X and Y, the goal is to find a linear projection x’ = Ax into an m-dimensional space, where m << n, so that class separation is best preserved in the lower dimensional space. The method can also be easily extended to more than two classes.

In other words, the goal is to find a matrix A belongs to R^(m×n) so that the squared Euclidean distance of two reduced feature vectors,

is large if x and y belong to different classes, and small otherwise. Note that this goal is very similar to that of metric learning. However, here we are explicitly interested in the dimensionality reduction so that subsequent learning algorithms do not have to be distance-based. One could use the method large margin nearest neighbor or information-theoric metric learning and decompose the matrix M, but there is no guarantee that the resulting dimensionality reduction is optimal in the sense of our goal.

We instead take inspiration from LDA and find the projection matrix A = argmin(A) J(A) that minimizes the distance between feature vectors of the same class while simultaneously maximizing the distance between samples of different classes, measured by the unweighted ratio between intra-class and inter-class distance (like LDA),

equation 1

Unfortunately, a closed form solution of the above minimization problem is not readily available. Therefore, we choose to iteratively optimize it instead. In particular, we employ the well known method of gradient descent to find a (local) minimum,

where At and At denote the solution candidate and update at time step t, Gt := del(A)J(At) denotes the gradient of the target function with respect to A, and the hyper-parameter nt is the learning rate at step t.

Stochastic Gradient Descent

Stochastic gradient descent (SGD) has become a popular tool to speed up the learning process of deep neural networks. The main insight of SGD is that, due to the linearity of the differential operator del(A), the gradient of an objective function that sums over several sub-goals, J(A)=summation(i=1 to N)Ji(A), can be written as

In empirical risk minimization, one of the main pillars in machine learning, the objective function typically takes this form, where each sub-goal corresponds to minimizing the loss on one single training example.

SGD utilizes this structure by approximating the gradient with a random subsample of the training data at each step of the gradient descent,

where K << N is the number of samples to consider each step and sigma-t is a random permutation of the set {1, . . . ,N}. The intuition is that, while the individual approximations of the gradient do not point in the same direction as the true gradient, the approximation error averages out when a large number of iterations are performed. SGD requires more iterations to converge than gradient descent, but each iteration is much less computationally expensive. Thus, the overall convergence speed is generally faster, especially with large datasets or when the Ji are expensive to compute.

Unfortunately, the objective function in equation 1 does not have the required structure. Both the nominator and denominator, however, do. This suggests that the general idea is applicable regardless: At each step of the gradient descent, we sample a small subset of the data, so that an equal number of samples is drawn from both X and Y. The gradient Gt is approximated using only these samples.

AdaGrad

Stochastic gradient descent, like regular gradient descent, is very susceptible to the choice of learning rate. If the learning rate is too large, then good solutions may be missed; if it is too small, and the algorithm will converge very slowly. Ideally, the learning rate should be large in the beginning, but small when the intermediate solution is near an optimum in order not to oscillate around it.

AdaGrad defines a learning rate schedule that takes previous parameter updates into account. The learning rate for the (ik)-th component of the gradient is chosen as

where n is a global learning rate, r(tau) denotes the time steps before t, and g, ik denotes the (ik)-th component of the historical gradient G. The intuition is that larger updates will result in a smaller learning rates in the future, which effectively smoothes out extreme parameter updates due to locally steep gradients.Much like with the momentum method or the Newton-Raphson method, each parameter aik is associated with a different learning rate schedule, which means that the progress along each dimension of the parameter space evens out over time. A drawback is that, depending on the global learning rate, the learning rates eventually approach zero and the optimization may stop before a minimum is reached.

Conclusion

We have presented a method for linear dimensionality reduction. The method takes inspiration from both PCA and Fisher LDA and tries to minimize the pairwise distances between samples of the same class while simultaneously maximizing the distance between samples of different classes. Instead of providing a closed form solution, we chose to employ the well known gradient descent optimization algorithm. In order to speed up computation, we used stochastic gradient descent with an AdaGrad learning rate schedule.

The biggest issue with our approach is the bias to effectively reduce the features to a one-dimensional feature space. We can plan to tackle this issue by re-weighting the terms of the objective function and introducing non-linearities in the projection. As this will result in a severely more complicated gradient, we will re-evaluate usage of automatic differentiation tools.

Another interesting research question concerns the use of step-wise dimensionality reduction: Instead of a direct reduction from n to m dimensions, the algorithm would first reduce the n features to n − 1 features, then the intermediate n − 1 features to n − 2 features, etc., until the desired number of features is reached. The hope is that each of the intermediate reductions is easier to solve well than the full dimensionality reduction — especially if non-linearities are introduced as well.

Thank you!

References

J. B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27, 1964.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems, pages 1473–1480, 2005.

Guided Linear Dimensionality Reduction by Stochastic Gradient Descent, Matthias Richter.

Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.

If you like this article, don’t forget to leave a “clap”!

Thank you for your time.

https://www.linkedin.com/in/learner-subodhlonkar13/

http://learner-subodh.github.io

--

--

Subodh Lonkar
Analytics Vidhya

Data Scientist passionate about leveraging cutting-edge technology for solving real world business problems. Portfolio: https://learner-subodh.github.io/.