Feature Extraction Techniques: PCA, LDA and t-SNE

Published in

Analytics Vidhya

8 min readJan 6, 2020

Transforming data using unsupervised/supervised learning can have many motivations. The most common motivations are visualization, compressing the data, and finding a representation that is more informative for further processing. One of the simplest and most widely used algorithms for all of these is principal component analysis. We’ll look at other two algorithms: Linear Discriminant Analysis, commonly used for feature extraction in supervised learning, and t-SNE, which is commonly used for visualization using 2-dimensional scatter plots.

What is Feature Extraction ?

Feature extraction is a process of dimensionality reduction by which an initial set of raw data is reduced to more manageable groups for processing. A characteristic of these large data sets is a large number of variables that require a lot of computing resources to process. Feature extraction is the name for methods that select and /or combine variables into features, effectively reducing the amount of data that must be processed, while still accurately and completely describing the original data set.

Why is this Useful?

The process of feature extraction is useful when you need to reduce the number of resources needed for processing without losing important or relevant information. Feature extraction can also reduce the amount of redundant data for a given analysis. Also, the reduction of the data and the machine’s efforts in building variable combinations (features) facilitate the speed of learning and generalization steps in the machine learning process.

Practical Uses of Feature Extraction

Autoencoders: The purpose of autoencoders is unsupervised learning of efficient data coding. Feature extraction is used here to identify key features in the data for coding by learning from the coding of the original data set to derive new ones.

Bag-of-Words: A technique for natural language processing that extracts the words (features) used in a sentence, document, website, etc. and classifies them by frequency of use. This technique can also be applied to image processing.

Image Processing: Algorithms are used to detect features such as shaped, edges, or motion in a digital image or video

Implementation

In this article, we will apply few feature extraction techniques on Image Segmentation Dataset taken from UCI Machine Learning Repository. This repository already have separate training and test set, hence no need to consider train-test-split explicitly in our implementation. This dataset has 2100 instances and 19 attributes in training set, hence making a perfect candidate for the application of different feature extraction techniques. My complete code could be found in github.

First of all, we need to import all the necessary libraries.

Import Libraries required for Feature Extraction

Both training and test dataset is then extracted from the downloaded zip file and converted into corresponding pandas dataframe respectively.

Input training set in dataframe with LABELS column as target column

Further, we will start with the preprocessing the training and test set. Class column LABELS has to be converted to label encoded format which can be easily done using LabelEncoder from scikit-learn package. Also we will ensure to use StandardScalar to bring all attributes to the same magnitude with mean as 0 and the variance as 1.

Before applying any feature extraction techniques like PCA or LDA, we will check what’s the performance of Random Forest model on the input dataset.

The evaluation results from Classification Report is very much impressive and gives almost 99% accuracy on the test set with total samples of 210.

Apply Principal Component Analysis (PCA)

PCA is a method that rotates the dataset in a way such that the rotated features are statistically uncorrelated. This rotation is often followed by selecting only a subset of the new features, according to how important they are for explaining the data. In this way, PCA can be used for dimensionality reduction by retaining only some of the principal components. It is important to note that PCA is an unsupervised method, and does not use any class information when finding the rotation. It simply looks at the correlations in the data.

Let’s apply PCA for our input dataset, but first, we need to determine number of principal components to be generated from given set of attributes. This can be determined by looking at the cumulative explained variance ratio as a function of the number of components:

This curve quantifies how much of the total, 19-dimensional variance is contained within the first N components. For example, we see that with the segmentation dataset the first 5 components contain approximately 75% of the variance, while you need around 12 components to describe close to 100% of the variance.

Here we see that our two-dimensional projection loses a lot of information (as measured by the explained variance) and that we’d need about 12 components to retain 90% of the variance. Looking at this plot for a high-dimensional dataset can help you understand the level of redundancy present in multiple observations.

Hence, we will keep n_components = 12 and extract the principal components followed by giving it to the existing Random Forest model to evaluate performance.

After applying PCA, performance is still the best even after reducing number of features from 19 to 12.

Heat map of the first 12 principal components on the Image Segmentation Dataset

You can see that in the first 3 components, all features have almost the same sign (near to 0.0). That means that there is a general correlation between all features. As one measurement is high, the others are likely to be high as well. From 3rd component has mixed signs, and further components involve all of the 19 features from Image Segmentation Dataset.

Apply Linear Discriminant Analysis

LDA is supervised learning dimensionality reduction technique and aims to maximize the distance between the mean of each class and minimize the spreading within the class itself. LDA uses therefore within classes and between classes as measures. This is a good choice because maximizing the distance between the means of each class when projecting the data in a lower-dimensional space can lead to better classification results.

When using LDA, is assumed that the input data follows a Gaussian Distribution (like in this case), therefore applying LDA to not Gaussian data can possibly lead to poor classification results.

For our dataset, again it is vital to determine the number of components needed for LDA. We will take the help of cumulative explained variance ratio as a function of the number of components.

The first 5 components (0 to 4) is enough to explain the 100% variance in dataset. Hence, after applying LDA and Random Forest below is the evaluation results:

The evaluation performance is better than PCA because LDA gives similar performance but with reduced number of components.

Pairwise plot relationships between 5 LDA components explaining variance between different labels.

PCA vs LDA

Both Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are linear transformation techniques that are commonly used for dimensionality reduction. PCA can be described as an “unsupervised” algorithm, since it “ignores” class labels and its goal is to find the directions (the so-called principal components) that maximize the variance in a dataset. In contrast to PCA, LDA is “supervised” and computes the directions (“linear discriminants”) that will represent the axes that that maximize the separation between multiple classes.

Although it might sound intuitive that LDA is superior to PCA for a multi-class classification task where the class labels are known, this might not always the case.
For example, comparisons between classification accuracies for image recognition after using PCA or LDA show that PCA tends to outperform LDA if the number of samples per class is relatively small (PCA vs. LDA, A.M. Martinez et al., 2001). In practice, it is also not uncommon to use both LDA and PCA in combination: E.g., PCA for dimensionality reduction followed by an LDA.

Apply t-SNE for better visualisation

While LDA is often a good first approach for transforming your data so that you might be able to visualize it using a scatter plot, but the nature of the method (calculating variance difference between classes) limits its usefulness.

Hence, there is a class of algorithms for visualization called manifold learning algorithms that allow for much more complex mappings and often provide better visualizations. A particularly useful one is the t-distributed Stochastic Neighbor Embedding (t-SNE).

For our Image Segmentation Dataset, it is difficult to represent original labels as data points on scatterplot, hence we will map them to integer labels

{BRICKFACE = 0, CEMENT = 1, FOLIAGE = 2, GRASS = 3, PATH =4, SKY = 5, WINDOW = 6}

And n_components we will consider as 2 because it is easier to proceed with scatterplot between only 2 components from LDA.

Scatter plot of Image Segmentation dataset using first two LDA components.

Data points with labels 5 and 3 are well separated from other data points but most of the other data points still overlap significantly.

Now let us apply t-SNE. But before that what is t-SNE? The idea behind t-SNE is to find a two-dimensional representation of the data that preserves the distances between data points as best as possible. t-SNE starts with a random two-dimensional representation for each data point, and then tries to make points that are close in a original feature space closer, and points that are far apart in the original feature space farther apart. t-SNE puts more emphasis on points that are close by, rather than preserving distances between far-apart points. In other words, it tries to preserve the information indicating which points are neighbors to each other.

We will apply t-SNE which is available from scikit-learn.s

Scatter plot of Image Segmentation dataset using first two components found by t-SNE

The result of t-SNE is quite remarkable. All the classes are quite clearly separated. The 3s (GRASS) and 5s (SKY) are somewhat split up, but most of the classes form a single dense group.

Note: t-SNE method has no knowledge of the class labels; it is completely unsupervised. Still, it can find a representation of the data in 2-dimensions that clearly separates the classes, based on how close points are in the original space.

The t-SNE algorithm has some tuning parameters, though it often works well with default settings. You can try playing with perplexity and early_exaggeration, but the effects are usually minor.

I hope you enjoyed this article, thank you for reading!

Contacts

Bibliography

[1] Introduction to Linear Discriminant Analysis by Sebastian Raschka. Accessed at: https://sebastianraschka.com/Articles/2014_python_lda.html

[2] In Depth: Principal Component Analysis. Accessed at: https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

[3] Visualizing Data using t-SNE at ResearchGate. Accessed at: http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

[4] Image Segmentation Dataset. Accessed at: https://archive.ics.uci.edu/ml/datasets.php