Random Projection for Dimension Reduction

Moving beyond PCA

Mehul Gupta
Data Science in your pocket
4 min readAug 10, 2020

--

courtesy: https://www.groundai.com/project/random-projections-of-mel-spectrograms-as-low-level-features-for-automatic-music-genre-classification/1

In my last post, I discussed how PCA has got some issues & how these issues can be resolved. But, still many problems exist where most of the Dimension Reduction techniques get hampered. Let me point out a few:

  • Time complexity is important. For example PCA on a matrix of size n×k, takes 𝑂(𝑘²×𝑛+𝑘³) time. If k is huge (say 10000), PCA can be as slow as a snail. This problem exists with many other Dimension reduction methods as well.
  • Many techniques might get misguided when outliers are present. Do you know PCA itself is quite sensitive to outliers? check this out

On deep diving, I read about a simple yet powerful algorithm i.e Random Projection

Have you ever heard of it?

If not, this post is for you

But before moving on, why should one pick it up?

Very Fast. Time complexity = 𝑂(𝑛𝑘𝑑) where k,d are original dimension & reduce dimension respectively with k>d (pretty obvious). This is way lower than that of PCA

Robust to outliers.

Hence, as we have inspiration now why should one know Random Projection, let's know a major concept 1st.

Johnson-Lindenstrauss Lemma

It states that we can transform a high-dimension(this has to be very high) data to a lower dimension data nearly preserving the distance between any two points.

Consider this

We have a space S of dimension k (say 2600).

Now, let

point1= (a1,b1,c1,d1…….z100), point2=(aa1,bb1,cc1,dd1……zz100).

& distance between point1 & point2= ‘p’ units in space S.

Now,

These points can be reduced to a new Space S1 of dimension ‘m’(say 1300) where m<k such that

point1=(a’1,b’1,c’1….z’50) & point2=(aa’1,bb’1,cc’1...zz’50)

in the new space S1.

Now by Johnson-Linderstrauss lemma,

Distance between point1 & point2 ‘p’ units & hence preserved.

Can this lemma be useful for dimension reduction?

Definitely Yes as,

If the distance between the samples is preserved, the relative distinctiveness between samples is preserved hence very useful for dimension reduction & more powerful when using discriminative models (logistic regression, SVM, etc).

The entire idea of Random Projection is based on the above lemma. The implementation is as simple as a “Hello World” program:

Random Projection Algorithm

  1. Take dataset K, of the dimension Mx N(M=samples, N=original dimension/features)
  2. Initialize a random 2d matrix R of size N x D where D= new reduced dimension
  3. Normalize the columns of R making them unit-length vectors.
  4. Matrix multiplication J=K * R. J is the final output with dimension M x D.

As easy as that !!

I missed some important points

  1. Can Random Projection be used for any dimension reduction problem?

Ideally, Random Projections should be used when we have a huge number of features(say 1k-2k) & hence need a speed boost up. Else, PCA is still the king.

2. Can D(new dimension) be any number from 1-N (original dimension)?

No, It must be kept in mind that Random Projection works best when we reduce a high dimension space to ‘medium’ dimensional space & will lead to a complete disaster when D chosen is very small like 1,2,3, etc when N is very high, say 1000.

3. How to determine the size of the new dimension D?

We have multiple formulae published in different papers. You can pick up any one of them !!

D> [9×1/(ε²-2×ε³/3)¹×log(M)]+1

Or

D≥ 4×1/(ε²/2-ε³/3)¹×log(M)

Where M=Total samples, ε=error rate, D=new dimension. The above inequalities state that D has to be at least greater than the value on the right

A few facts we can draw from any of these equations:

  1. Interestingly, New dimension D depends on the number of samples M rather than the original dimension N.
  2. These equations can help us to know whether Random Projection can be used efficiently for a given dataset. If D (calculated using any of the above formulae)> N (original dimension), we should avoid using random projection
  3. An error rate is a number between 0–1 which determines how much error is tolerable in the transformed data (set by the user). It must be kept in mind as ε goes high, D goes low. Hence if we are ok with errors in transformed data, D can go very low.

Random Projection is of 2 types depending upon the way random matrix R (as mentioned in step3 of the algorithm) is initialized:

Gaussian Random Projection

Here, all elements of R (random matrix of size N x D) are initialized using a Normal distribution with mean=0, standard deviation=1/D. Everything else remains the same

Sparse Random Projection

The R Random Matrix R is initialized as a sparse matrix using the below method

Where, s=sqrt(N i.e original dimension), n_components=D(new dimension)

It must be noted that sparse random projection is faster than Gaussian Random projection due to its sparse nature.

Also, you don't need to worry about implementation as sklearn has prebuilt functions.

With this, it's a wrap !!

--

--