Dimensionality Reduction using Principal Component Analysis (PCA)

Rajvi Shah
Analytics Vidhya
Published in
5 min readSep 13, 2020

An important task to handle dataset with more number of features/dimensions.

Reference: Medium

Data keeps on increasing every second and it has become crucial to interpreting insights from this data to solve problems. And, as features of data increases so dimensions of the dataset increases. Eventually, a Machine Learning model needs to handle the complex data resulting in more complexities. On the other hand, there are a lot of features that are futile for the model or are correlated with others. Principal Component Analysis (PCA) is the way out to reduce dimensions and deduct correlated features from the dataset.

The article is divided into the following sections:

  1. Definition- PCA
  2. Need & Advantages of PCA
  3. Real-time usage/application
  4. Steps to perform PCA —
  • Data standardization
  • Computing covariance matrix
  • Determining eigenvalues and eigenvectors
  • Computing PCA features

5. Implementing PCA to MNIST dataset using Python

6. Conclusion

What is PCA?

Principal Component Analysis(PCA) is a Dimensionality Reduction technique that enables you to identify correlations and patterns in a dataset so that it can be transformed into a dataset of significantly fewer dimensions without loss of any important information.

Need of PCA

A dataset with more number of features takes more time for training the model and make data processing and exploratory data analysis(EDA) more convoluted.

Advantages of PCA

  • Reduces training time.
  • Removes correlated features (removes noise).
  • Ease for data exploration (EDA).
  • Easy to visualize data (maximum 3D data).

Real-time applications

PCA is used for dimensionality reduction in the domains such as face recognition, computer vision, image compression, image detection, object detection, image classification, etc.

Steps to perform PCA:

Data Standardization

  • Standardization is all about scaling the data in such a way that all the values/variables are in a similar range. Standardization means rescaling data to have a mean of 0 and a standard deviation of 1 (unit variance).
  • Why? It is about making sure that data is internally consistent; that is, each data type has the same content and format.
Reference: 365datascience

Computing covariance matrix

  • The covariance matrix is computed after data standardization and it is used to find correlated features in the dataset. Each element of the covariance matrix represents the relation between two features.
  • Why? It is used to determine the correlation between features.
Reference - Analysis Factor

Determining eigenvalues and eigenvectors

  • Eigenvalues and eigenvectors are the mathematical constructs that must be computed from the covariance matrix in order to determine the principal components of the dataset.
  • Principal components are the new set of variables/features that are obtained from the initial set of features. They compress and possess most of the useful information that was scattered among the initial features.
  • Eigenvectors are those vectors when a linear transformation is performed on them then their direction does not change.
  • Eigenvalues simplify denote the scalars of the respective eigenvectors.
  • Why? To find the direction of the maximum spread of the features.
Reference - HMC

Computing Principal Components

  • Let’s suppose there are 5 features in a dataset, then after computing eigenvectors and respective eigenvalues, there will 5 principal features, each of the features will have its eigenvalue and eigenvector.
  • The features with the highest eigenvalues have most of the details/spread of the data in comparison to other features.
  • So the features with high eigenvalues are considered as principal features and those features would be the output of PCA.
  • Now, these principal components would be used as input to train the model and for data visualization.

Implementing PCA to MNIST dataset using Python

  • Problem statement — To perform step by step PCA to MNIST dataset in order to reduce dimensions.
  • MNIST dataset contains various images of 0 to 9 numbers and it is primarily used to recognize image/digit for beginners. Each image is 28 * 28 pixels and when converted to vector form, it would be 28 * 28 = 784 features and it is tough to visualize 784 features on the screen and to train the model. So, here I will reduce the dimensions of the dataset and visualize data in 2D.
  • Dataset is already prepared and converted to matrix form(stored in CSV format), so it would be easy to use it for further tasks (How data is prepared is not explained in this article as the subject of the article is PCA). To download data, visit — https://www.kaggle.com/c/digit-recognizer/data
  • Let’s dive into coding. Firstly, load data and read using pandas;

Then separate label data and features data;

Apply column standardization the sklearn library;

Now, computing covariance from standardized data as follows;

Following to that, find eigenvectors and eigenvalues here I have computed the top 2 eigenvectors;

Computing principal components;

Plotting data with two principal features using seaborn;

Implementing the PCA package from the sklearn library;

Conclusion

  • Dimensions are reduced to 2 from 784 and the main information is saved in those 2 features which can be useful to determine the digit.

You can find the source code from Github

If you have confusion regarding any function/class of the library, then I request you to check the documentation for that.

If there’s any correction & scope of improvement or if you have any queries, let me know at rajvishah2309@gmail.com

--

--