DAY 10 :Dimensionality Reduction with PCA and t-SNE in R

6 min readJan 5, 2018

This article is the tenth one in the series Getting started with Data Science in 30 days using R programming!! To get other articles refer to this article. To know more about this series, refer to this article.

Once we obtained data, cleaned , transformed and placed in a clear tabular format, we might find it sparse( having many columns filled with more zeros than required numeric values ) or with large number of dimensions ( features or columns or attributes) ranging may be from few hundreds to thousands. How can we visualise such a huge data in three dimensional or two dimensional space? How can we use it for further computations? Dimensionality reduction comes into picture here.

Dimensionality Reduction is the process of reducing the dimensions of data without loosing much of information. But is this ever possible? We are reducing dimensions that is we are reducing the importance of certain attributes and we don’t want any loss in the integrity of data!! Yes, it is

Let’s see how we can do this.

Have a glance at the following image

Here you can see, in the first image, it is three dimensional data with X,Y, Z axes. The second image is a two dimensional space with PC1, PC2 as axes.

Note these PC1 and PC2 are not our regular dimensions and we cannot name them with any of the previous attribute names. They represent the orthogonal projections along which the variance of data is high. We will understand more about it while dealing with PCA below.

We will discuss two main dimensionality reductions today.

Principal Component analysis
t-SNE

PRINCIPAL COMPONENT ANALYSIS :

Principal component analysis is the unsupervised learning methodology in which the data is converted from a high dimensional space into another space with lower or equal number of dimensions. In this lower dimensional space, the dimensions are the projections of variance of data. Note that PCA can be applied only to numeric data.

To understand more about this consider a data set with employee data . The attributes of this data set could be Employee id, salary, age and incentive. So, the data set has 4 dimensions. Now, we obtain the projection of this data set with highest variance, which will be the most important or principal component in our lower dimensional space. The component with next highest variance will be the second principal component and so on. Note that these projections are orthogonal to each other.

In order to understand these projections, an understanding of eigenvectors and eigenvalues will be of use. Eigenvector is the direction which gives the span of a matrix and the eigenvalue is the scalar associated with it. The one with highest eigen value will be the first principal component and the next one will be the next principal component and so on. To know more about eigen vectors,there is this awesome Youtube lecture. Refer to it.

Now that we have come to know about principal components, Let’s see a little math behind principal component analysis and then code it in R.

Singular Valued Decomposition

Singular valued Decomposition is the process in which if there a square matrix, it can be represented in terms of it’s eigen values.

If M is a square matrix, it can be written as UVUi where

U = matrix of eigen vectors

V = diagonal matrix of eigen values

Ui = inverse of U

M = U VUi

Now, but our dataframe is not squared one. It can have n rows and m columns and is rectangular. How can we apply SVD to it ? Let’s consider that data frame is converted into a matrix format by now that is it has homogeneous data.

Singular Valued Decomposition of a matrix with n rows and m columns :

When we apply SVD on to a matrix A of dimensions n X m, it turns into

U Z Vt

where U is the matrix of orthogonal eigen vectors of AAt

V is the matrix of orthogonal eigen vectors of AtA

the components of Z are the eigen values of AtA

At is the transpose of matrix A.

PCA using R :

The base R package provides prcomp() method to calculate PCA in R. It tries to center data with mean =0. The parameter scale. is set ‘T’ which means standard deviation is set 1.

> data_pca <- prcomp(data, scale.=T)

Now, let’s apply pca to a dataset.

>data-PCA <- prcomp(crimtab,scale.=T)
>class(data_PCA)
"prcomp"

As I already told that the first component is the one with the highest variance and second will be the second highest in variance, let’s view this in a plot.

>plot(data_PCA)

The first bar is the line along which variance is highest and hence first principal component. The decrease in the variance can be clearly seen in the image.

Applications of PCA:

Neuroscience
Business analytics
Any data analytics project which has linear data.

Disadvantages of PCA:

PCA produces only rotational transformation of data

2. It is not suitable for non linear data. This can be understood with the following data.

PCA gives the information only along the dotted lines joining the two end points of this spiral data which shows how it is not suitable for such data.

t-Distributed Stochastic Neighbour Embedding

Now, observe figure1 once again. The dotted lines represent PCA. It thus lost significant detail of the data by only considering the projection. What about the spiral line across the data?
We see, it stores all the details of the data in a better way. This line is obtained by applying t-SNE dimensionality reduction method.

t-SNE is the dimensionality reduction which maps data in a higher dimensional space to that of a lower dimensional space just like PCA but uses a similarity measure like Euclidean distance to learn about discrepancies between pairs of data. While doing this, the local structures of data are preserved which is not the case of PCA.

Algorithmic details of t-SNE (optional):

Step -1 : Let’s consider xi and xj be two points in high dimensional space. The similarity of the data point xi to that of xj is given by p(j/i). For nearby points. p(j/i) is very high and for farther points, it is very small.

Step-2: Similarly, in lower dimensional space, we have counter parts of xi and xj as yi and yj respectively. We calculate their similarity probability as q(j/i)

Step 3: Now, in order that both spaces are similar we need to have difference between p(j/i) and q(j/i) as zero. But it is not possible practically. But our idea is to get this done. t-SNE tries to minimize Kullback Divergences between the conditional probabilites.

Step 4: We try to set sigma of the p(j/i) as it is variable across data and there by perplexity value in this step. Perplexity is the number of nearest neighbours in a manifold. Larger datasets generally need larger perplexity values.

t-SNE in R :

R has a package called Rtsne. We need to download it and load into the workspace first.

>library(Rtsne)
>Rtsne(crimtab,dims=2, perplexity=50)

Applications of t-SNE:

1.Facial expression recognition

2.Medical Imaging

3. Word vectors

Disadvantages of t-SNE:

As this algorithm finds similarity between pairs of points, it has high time(quadratic) and space complexity as that of size of data.

Thus, we apply dimensionality reduction to data. Apart from these, there is a clustering technique. Following are some references which help you better understand various concepts mentioned in this article.

Visualizing data using t-SNE

Official website for tsne : Don’t forget to refer it

t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a ( prize-winning) technique for dimensionality reduction that…

lvdmaaten.github.io

Youtube lecture to explain about eigen vectors and eigen values

How did you feel about this article? Add a response or clap to let me know!!!

It involves a little more effort to understand all the math behind. But is definitely worth it. So feel free to google to know more. I feel this article gives you a good start to dive in!!

Happy (Unsupervised) Learning R !!