Dimensionality Reduction and PCA

Ana Belén Manjavacas
6 min readAug 8, 2023

--

As a Data Scientist, one of the most significant challenges in a project is dealing with large datasets that contains numerous dimensions or features. This complexity significantly hampers the data interpretations process.

Obviously, a minimum number of features is needed to build a good model, however, when we have a lot of them it could be negative for the model performance, causing overfitting.

Therefore, the idea would be to use only a carefully selected subset of attributes, the ones with more relevant information about the problem. This approach results in dimensionality reduction, a process that offers several advantages. In particular, this reduces the computation time required by the model, mitigates the risk of overfitting, and effectively enhance the overall performance of the model.

In the context of dimensionality reduction, two primary strategies emerge: feature selection, which involves choosing a subset of the original features through various algorithms, and feature extraction or construction, which entails creating new features that encapsulate the most pertinent information from the original attributes. It is in the last one where Principal Component Analysis (PCA) finds its place.

The idea tries to find a transformation that preserves de information of the problem minimising the number of components. The optimal function would not be lineal; however, this would make the problem very complicated. Hence, linear transformations have been accepted.

PCA operates on the principles of signal representation, with the primary aim of the feature extraction transformation being to represent attribute vectors in a lower-dimensional space accurately.

These components represent the data points in a more concise manner, leading to a compressed an efficient representation that retains the essential patters and structures of the original data. By achieving this, PCA facilitates insightful data analysis, visualisation and model building in situations where high-dimensional data can be computationally burdensome and challenging to interpret.

Maths behind Principal Component Analysis

The objective of PCA is to reduce dimensionality maximising the information from the original data without taking their class into consideration.

PCA reduce dimension projecting the original data into M vectors in which to project the data in order to minimise the projection error. M is the final dimension, so if you want to see your final space as 2D dimensional space, then M=2.

To find the M vectors, or the Principal Components, we are going to be looking into the covariance matrix of the data. In essence, our objective is to analyse the central tendencies and dispersion of variables concerning their mean, quantified as covariance — representing the degree of deviation from the mean. By constructing a covariance matrix, we seek to identify the most informative directions in the data. These directions are characterised by the highest eigenvalues, which emerge from the diagonalization process of the covariance matrix, following the standardisation of features.

Standardising the data

PCA is quite sensitive to variances of the initial variables, so that we are going to start standardising the initial features. Otherwise, the variables with larger differences between ranges will dominate over those with small ranges, create like this biased results. Standardising the data can prevent this problem. Mathematically:

Getting the covariance matrix and diagonalization process

The covariance of a matrix is given by the formula:

This is the matrix we need to find the eigenvectors and eigenvalues. Where the eigenvectors will be the directions (principal components) and eigenvalues will represent the amount of variance of each principal component.

Then we need to reorganise the eigenvectors. The higher eigenvalues come before the lower ones. We are going to select the k eigenvectors corresponding to the k highest eigenvalues.

Dimensionality reduction

To reduce dimensionality, the idea is to keep the k eigenvectors with more information about our data.

If we want to keep certain percentage of the total variance, there is a formula we need to apply. If we want to keep 95% of the variance we need to have the following

Python Example

We are going to use an example of wines dataset. We are going to do PCA and clustering to solve the problem. First, we are going to import the libraries we will need. Additionally, we are going to load the data.

Now we will use clustering in the wine database. The goal is to check if the clustering discovers the different real wine types.

The database describes the parameters of differentes wine instances. There are 3 types of wine and 13 wine features with the levels of the most important indicators:

  • Alcohol
  • Malic acid
  • Ash
  • Ash alkalinity
  • Magnesium
  • Total phenols
  • Flavanoids
  • Nonflavoid phenols
  • Proanthocyanins
  • Color intensity
  • Hue
  • OD280OD315
  • Proline

Data Description

Data Exploration

After loading the database we need to do some basic processing: standardisation and PCA.

Clustering and K-means

We are going to perform k-means looking for the optimal number of clusters according to the Calinkski-Harabasz score:

Where the k-means cluster centres will be:

We can see other things from our data, like the labels of the k-means and the shape of our dataframe:

Now we are going to plot two of the Principal Components against the 3 clusters and we are going to see how it looks like.

The code aims to visualise the clustering results in a 2D space using PCA and provide statistics about the distribution of data points in each cluster, as well as the distribution of different real classes within each other.

We can get the cluster centres as well. After all, we are going to create a dataframe with our data and the predicted clusters, so that we can do an interpretation of our data.

Interpretation of the obtained clusters

We are going to check the mean for each feature/cluster. The heatmap represent the means standardised attributes grouped by cluster.

Interpretation of the clusters using a decision tree

To finish, we are going to use a decision tree to plot how our data would look like as a tree.

References

https://medium.com/towards-data-science/theory-of-principal-component-analysis-pca-and-implementation-on-python-5d4839f9ae89

--

--