Curse of dimensionality

Introduction

Paritosh Pantola
8 min readMay 10, 2018

In machine learning high dimensional dataset is common thing. Visualization is not possible when dataset is in high dimensional space. So, to reduce the higher dimensional space into lower dimensional space becomes the important part in machine learning . We can reduce the dimension by following two ways:

Feature selection — Selecting important features which are relevant to model (it avoids the curse of dimensionality)

Feature extraction — Transformation of high dimensional space into lower dimensional space by using various methods such as PCA, TSVD, T-SNE etc

Curse of dimensionality

It is a phenomena which occurs in high dimensional space that hardly occur in lower dimensional space. Due to higher number of dimension model gets sparse. Higher dimensional space causes problem in clustering (becomes very difficult to separate one cluster data from another), search space also increases, complexity of model increases.

So, what is basically curse of dimensionality. Let us understand this phenomena by simple ‘Paritosh and his dog farm’ example.

As in figure 1 Paritosh’s only task to find all dogs (feature1) which is very easy as he has to walk in straight line to pick all his dog here data density is also very high hence he will not face any difficulty to trace all the dogs.

Suppose in second scenario we are adding another feature space such that now his task is to find those dogs(feature 1) which belongs to his farm(feature 2) only. Here he will take little more time as compared to figure 1 .

Now in third scenario his task is to find only those dogs (feature1) which belongs to his farm (feature2) and breed is Labrador (adding one more feature space). Here number of dimensions increased to three and data density also decreased hence his task becomes more difficult.

So, if we add another dimensions say color, region, diet, health it become more and more difficult for him to find his dog. So, we can say that by increasing number of features data density decreases and complexity increases and it became very difficult for machine learning model to work efficiently.

To overcome curse of dimensionality, dimensionality reduction comes into picture, It is reduction of high dimensional space to lower dimensional space such that it become easy to visualize the dataset. There are various dimensionality reduction methods present some of them are :

· PCA

· t-SNE (non-parametric/ nonlinear)

· Sammon mapping (nonlinear)

· SNE (nonlinear)

· MVU (nonlinear)

· Laplacian Eigenmaps (nonlinear)

The important and effective approaches for dimensionality reduction are PCA and T-SNE which will be discussed in this chapter

PCA — Principal Component analysis

PCA is a dimensionality reduction, feature extraction technique that convert possible set of correlated variables to linear uncorrelated variables (called principal components) using orthogonal transformation. PCA technique is used for those dataset which are scaled.

Objective Function : PCA

Suppose we have two-dimension dataset which is spread between x axis and y axis as shown in figure 4.In figure 4 data distribution between two features is very wide so it become difficult to reduce the dimensions for this dataset

If we rotate our axis d1 and d2 to say d1' and d2' such that data variance or spread across that axis is maximum than it become easy to reduce the dimension of dataset because now most of the spread density is across d1' as shown in figure 4. So, our main objective here is if we can find d1' where dataset has maximum spread density than we can reduce our two-dimension dataset to one-dimension dataset (we can remove d2' later because there is no data distribution along d2')

So our objective here is to find the theta value (rotation of axis d1 and d2 such that most of the data lies on the final rotated axis, here it is d1' and d2' which is rotated by theta value as shown in figure 5).

Suppose our data is column standardized. Let’s say we have direction u such that u is the unit vector and data distribution along u is maximum and we have datapoint x1 such that projection of x1 on u is x1' . By creating projection, we created new data points such that

and x1 can be defined as x1' = u1.x1∖⟨u1⟩^2 where u1 is a unit vector. Hence, we can say that given any point x1 we can convert into x1' if we have u as a unit vector . By using same approach, we can find mean vector of x1' given that we have mean for x1 . Objective here is to find the unit vector such that it points to maximum variance. So, variance for a point is defined as:

As our data is column standardized that means our second term becomes zero.Hence new variance is defined as:

So objective here is to find u(u-transpose) such that it maximize the variance where u must be unit vector. Above equation is the objective function and to find maximum variance we can use eigen value(λ) and eigen vector(V) of co-variance matrix where co-variance matrix is defined as XtX(X transpose * X,square matrix) where x is n dimensional dataset. Computing and arranging λ such that λ1>λ2>λ3.. corresponding to every λ we have V1>V2>V3.. where V1 represent direction of maximum variance, V2 represent direction of second maximum variance, V3 is direction of third maximum variance and if we have to find out percentage of variance explained than we can go for eigen value(λ) . How to use eigen value? Suppose , we have 30 dimension dataset and we have to find only those feature where variance explained is more than 90% than eigen value (%of variance explained) for i-th dimension can be calculated as:

Hence we can say that PCA maximize the variance of projected points hence it retains as much variance as possible and therefore PCA tries to preserve global shape of data

PCA : Limitation

PCA will work if data distribution is linearly distributed as shown in figure 4 . Suppose data is not linearly distributed as shown in figure 6 we may face the problem of information loss. So, if we project unit vector along the data points we are loosing lot of other data points

T-SNE: An introduction

T-SNE(t distributed Stochastic Neighbour Embedding) is one of the powerful technique which is used for dimensionality reduction and visualization. As discussed PCA only preserve global shape of data but t-SNE preserve local as well as global shape of data hence by using t-SNE we can solve the information loss problem. How T-SNE preserves data lets understand with 2-d scatter plot as shown in figure 7

T-SNE basically find the similarity or close neighbor of all data point with each other and the point which are close to each other, T-SNE put it into one cluster as shown in figure 7.

SNE :Stochastic Neighbourhood Embedding

Before understanding T-SNE lets understand how SNE works. SNE basically match distribution of distance between high and low dimensional space using conditional probability provided distance are Gaussian distributed. So to find out similarity between two point in high dimensional space formula is:

Where pj|i is the conditional probability and xi and xj are the two data points for which we have to find similarity. If xi and xj are close to each other then pj|i is large

Similarly for low dimensional space conditional probability is :

So after constructing low dimensional space probability and high dimensional space probability next objective is how close pj|i and qj|i. i.e. when we are converting high dimensional data into low dimensional data whether we are loosing any characteristics of dataset . Here we are using cost function which is called Kullback-Leibler divergences

Where Pi and Qi are the conditional probability of high dimensional space and low dimensional space respectively, C is cost function, KL is Kullback-Leibler divergences. As Kullback-Leibler divergences is used for measuring the difference between probability distribution so it becomes asymmetric and hence it tries to preserve local structure also.

Although SNE is an effective algorithm but it faces two major drawback:

· Cost Function — Cost function which we used here is difficult to optimize

· Crowding problem — Sometime it becomes very difficult to preserve distance for all neighborhood as shown in figure 8

T-SNE: T-Stochastic Neighbourhood Embedding

T-Sne is non-linear dimensional reduction algorithm which is used for visualisation and exploring high dimensional dataset. To overcome SNE problem we use t-SNE where low dimensional data is t distributed (As t distributed is robust to outlier ) instead of normal distribution and to represent low dimensional space t-SNE formula is :

T-SNE basically consist of two steps

Step 1- Construct probability distribution for high dimensional space such that similar object have a high probability of being picked

Step 2- Define similar probability distribution for low dimensional space

T-SNE : Limitation

Although T-SNE is very powerful algorithm for visualization and exploration of high dimensional dataset but I faced some limitation when I applied TSNE on MNIST dataset

· TSNE will not work for low configuration laptop i.e you need at least 8 GB of RAM if you want to implement TSNE efficiently

· TSNE will crash even if you are using as high as 16 GB RAM if you will not perform any other dmensionality reduction technique ( maximum eigen value extraction)

· It is time consuming process. Sometime it will take 5–6 hour for complete execution (although execution time completely depends on the system configuration and amount of dataset )

· If someone is looking TSNE as general dimensional reduction than TSNE is not the right choice as it can only reduce high dimensional dataset to upto 3 dimension

References

[1] T-distributed stochastic neighbor embedding, Wikipedia

[2]Principal component analysis, Wikipedia

[3] Beginners Guide To Learn Dimension Reduction Techniques, Analytics Vidya

[4] Principal Component analysis, Applied AI Course

[5] Visualizing Data using t-SNE,2008, TSNE — Laurens van der Maaten, Geoffrey Hinton

--

--