Principal Component Analysis(Feature Extraction Technique)

Mayureshrpalav
5 min readJul 17, 2020

In Machine Learning algorithms our whole process works on the number of features provided to us in a dataset to Predict,Analyze,classify or cluster our results.

Features in binary classification for example to predict if a person has Lung cancer or not,they maybe: ‘Weight’,’Blood_Group’,’Smokes_or_Not’,’Age’,’Symptoms’ etc.

The features in some case maybe more than 40 or 50.In real world scenarios the feature tend to be more usually.When we try to apply any kind of algorithms to such data we may get a case of overfitting.Why?

We know that more the features means more the dimensions,training with such features will result in high bias and will lead to overfitting.Therefore we say having high dimension is a curse.

So we need to find a way to reduce the dimensions without eliminating each and every feature.

Here comes the concept of Feature Selection Vs Feature Extraction!

If we have high number of features the ‘Feature Selection’ technique takes in consideration subset of those features which are important or to eliminate those features which do not help classify our target.

While in our case of ‘Feature Extraction’s’ our goal is of creating a new, smaller set of features that stills captures most of the useful information.We must know ‘Feature selection keeps a subset of the original features while feature extraction creates new ones’.

“Feature extraction fills this requirement: it builds valuable information from raw data — the features — by reformatting, combining, transforming primary features into new ones… until it yields a new set of data that can be consumed by the Machine Learning models to achieve their goals.”

In this way, we do completely eliminate those least important features like in feature selection & do not lose our original dataset completely.

So Principal Component Analysis(PCA) is feature extraction technique meant to reduce the dimensions of our dataset.Note:We wont be going into detail of eigenvalues,eigenvectors involved in PCA, will just be showing how it works.

The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other,while retaining the variation present in the dataset, up to the maximum extent. The same is done by transforming the variables to a new set of variables, which are known as the principal components (or simply, the PCs).

Suppose we take a dataset with 2 ‘Feature 1’ & ‘ Feature 2’ features and plot them.Say i need to convert this to 1 dimension feature.How can we do it?

Firstly we can fit a line through these datapoints and project those datapoints on our line.This we can call as First Component or PC1.

Then we fit another line to our data orthogonal to our 1st line.

Fitting Components

Here in the above image we see that our Component 1 and Component 2 are two lines fitted towards our data-points.The 1st principal component(Component 1) captures the variation of our data very precisely.While our Component 2 has lot of information loss and variance lost which is not recommended.

Here we see that in above plots, dimension is reduced to 1 and both features can be measured accordingly and there is less information lost.

As we can see we have transformed 2 dimensional data points to one dimensional data points by projection them on 1 dimensional space i.e. a straight line.If we have 1000 features and we need to bring them down to 100 dimensions, we simply create Principal Components lines with respect to our dimensions and we select best 100 lines where our variance loss is the least.

PCA helps to identify the correlation and dependencies among the features in a data set. A covariance matrix expresses the correlation between the different variables in the data set. It is essential to identify heavily dependent variables because they contain biased and redundant information which reduces the overall performance of the model.

Mathematically, a covariance matrix is a p × p matrix, where p represents the dimensions of the data set. Each entry in the matrix represents the covariance of the corresponding variables.

Consider a case where we have a 2-Dimensional data set with variables a and b, the covariance matrix is a 2×2 matrix as shown below:

  • The covariance value denotes how co-dependent two variables are with respect to each other
  • If the covariance value is negative, it denotes the respective variables are indirectly proportional to each other
  • A positive covariance denotes that the respective variables are directly proportional to each other

Let us the working of PCA below:

We will work on dataset of breast cancer which is present in sklearn’s datasets library.

  • Importing datasets and necessary libraries.
Libraries

It is very important to scale our dataset before performing PCA because the each feature present may be on different scales.Let’s say that we have 2 variables in our data set, one has values ranging between 10–100 and the other has values between 1000–5000. In such a scenario, the output calculated by using these predictor variables is going to be biased since the variable with a larger range will have a more obvious impact on the outcome.

It can be calculated with below formulae:

Standardization
scale

Now after scaling the data all our features are on the same scale.

We now need to import our PCA for sklearn’s decomposition library.

pca

We have set the parameter(n_components=2) we will reduce the features or dimensions from 30 to 2.

We also now can plot this on a graph which will make things clear.

plot
components

In above numpy matrix array, each row represents a principal component, and each column relates back to the original features. we can visualize this relationship with a heatmap:

heatmap

Thank You!

--

--