Dimension Reduction using Isomap
Something you need for nonlinear data
Real-world data is never easy. Till now, I used to apply PCA for dimension reduction for any sort of problem without paying attention to the basic assumption it follows i.e features have a linear relationship(something like X=a+bY).
But is it true for every problem?
Not at all. We are very likely to observe situations like
X= a+ bY²
X=a+bY³ and many other versions….
There exists 2 families of Dimension Reduction based on the way data is projected.
When the data is projected linearly after dimension reduction (as we do in PCA).
When data is projected non-linearly after dimension reduction(Isomap). This class of algorithms is also called manifold learning algos.
Now, if data is nonlinear (i.e. correlation between variables/features is nonlinear), representing it in a nonlinear way would preserve much more information than linear representation & hence the 2nd class of algorithms is important.
Find the vlog version of the blog below
Before moving on, let us understand a few basic concepts:
Geodesic distance
Do look at the below images
As you must have got an idea, geodesic distance is the distance between 2 points following the path available/possible between the two points whereas Euclidean distance doesn’t have a path constraint to follow. It is the length of a straight line from point ‘a’ to ‘b’.
Double-centering a matrix
It means to transform a matrix such that
- mean for any row = 0
- mean for any column = 0
How can this be done?
For a given matrix Z of dimension 3x3, prepare
- Matrix X (Left) & Matrix Y(Right)
- Double centered Matrix for Z =Z-X-Y+Mean(Z)
Dissimilarity matrix
It is a matrix that represents dissimilarity between points in a dataset. This dissimilarity can be calculated using any measure. Though the most common measure is the distance between points. The more the distance, the more dissimilar the samples are. Below image can be taken as an example of the dissimilarity matrix (using distance as the basis for dissimilarity)
Here, distance between A →A=0,A →B=16,C →D=40 & likewise
Let's move to
Isomap/Isometric mapping
It is a manifold learning algorithm that tries to preserve the geodesic distance between samples while reducing the dimension. But first, let's have some data which might look like the below image
Here, the data forms a spiral shape (nonlinear). The red line determines the geodesic distance between x1 & x2 while the blue line represents Euclidean distance.
This must have given an idea of why euclidean isn’t preferred over geodesic as it doesn’t really give an idea of how far x1 & x2 in a nonlinear space.
But how to calculate the geodesic distance between x1 & x2?
For this, we follow 2 steps:
- Run KNN (using Euclidean distance )to form a neighborhood graph/adjacency matrix.
Why?
If you look at the image closely, if I determine n(say n=3–4) neighbors for point x1 using Euclidean distance, most/all neighbors for x1 will look something like the red circles below i.e following the spiral (nonlinearity of data can be seen in neighbors easily)
& hence if we figure out ’n’ neighbors for all points & form a ‘neighborhood’ graph/adjacency matrix, this graph can be used for measuring the geodesic distance between 2 points easily.
2. Calculate geodesic distance using the above graph.
if I apply any shortest path algorithm for the above-weighted graph (like Dijkstra), it will be giving us the geodesic distance between any two points.
Consider x1 & x2 above. For shortest path between them, it will include all points falling on the red spiral between x1 & x2. The shortest distance will then become the sum of weights of all these points lying in the shortes path hence giving us geodesic distance (approx) between x1 & x2
3. Form a dissimilarity matrix using the above-calculated geodesic distance between points.
4. Square the dissimilarity matrix & double-center it.
5. Eigendecomposition & choosing ‘k’ eigenvectors. This is something similar to what we do in PCA after calculating the correlation matrix.
Done !!!
Are we left with anything?
Yes
A few drawbacks always exist
- This version of Isomap is computationally heavy. Though other versions exist which are comparatively light.
- Parameter tuning for KNN is important as a wrong selection of ’n’ can be devastating. We can use Radius nearest neighbors as well for forming the neighborhood graph.
& with this, its a wrap !!