Dimensionality Reduction with PCA

M. Husni Nur Fadillah
8 min readJan 29, 2022

--

Photo by Giorgio Tomassetti on Unsplash

Hello everyone, I hope you guys continue to be enthusiastic about learning. As usual I’m back with an article that I hope can help you in any case, which is good of course. But this article is quite complex, which is about Principal Component Analysis, but don’t worry, I will try my best to explain this technique so that you can implement it in the future.

Introduction

Principal component analysis (PCA) is one of the pattern recognition techniques and one of its applications is to analyse the high dimensional data that is not easy to understand by just looking at the large amount of data¹. Mathematically, PCA is a feature extraction technique that generates new feature which are linear combination of the initial feature². PCA can reduce the dimension of data to be simpler, so this technique will extract the most efficient features from the given dataset.

Objectives

  • Theory of PCA
  • Calculate PCA with different approach
  • See the impact of PCA of classification problem

Why we use PCA?

The motivation of why we have to reduce the dimension of the data that we will use is to avoiding overfitting and redunancy, and also less computational cost.

The goals of PCA³ are

  1. Extract the most important information from the data table
  2. compress the size of data set by keeping only this important information
  3. simplify the description of the data set
  4. Analyze the structure of the observations and the variables
  5. Compress the data, by reducing the number of dimensions, without much loss of information
  6. This technique used in image compression

Besides that, there have been several studies that have analyzed the impact of this PCA, for example such as “Comparative Study of Dimensionality Reduction Techniques for Spectral-Temporal Data” by Shingchern D.You and Min-Jen Hung, this paper studies the use of three different approaches (include PCA) to reduce the dimensionality of a type of spectral-temporal features. Another study that discusses the impact of PCA is “Dimensionality Reduction using Principal Component Analysis for Network Intrusion Detection” by K.Keerthi Vasan and B.Surendiran.

Algorithm

Flow Chart for PCA Algorithm

The following is the algorithm of PCA⁴:

  • Calculate the means vector of all data
  • Subtract the means vector from each data point of all data:
  • Calculate Covariance matrix
  • Calculate eigen values and eigen vector of the covariance (or corelation) matrix, arrange them in descending order of eigen values
  • Select K eigen vectors corresponding to the K largest eigen values to contruct the Uk matrix with columns forming on orthogonal system. These K vectors, also know as the main components, form a subspace that close to the orthogonal data matrix.
  • Make a projection the orthonormal data matrix to the subspace, that be found.
  • The new data is the coordinates of the data points on the new space.
  • The original data can be approximated by the new data as follows

Data and Code

The code (notebook) that I use for this article it’s on my kaggle account and here’s the link.

The data that I use in this article is Employee Future Prediction, this data contains some variables, which are as follows:

  • Education education level that employee has
  • Joining year the year of employee is joined
  • City city office where posted
  • Payment tier highest, mid level, and lowest
  • Age the employee’s age
  • Gender gender of employee
  • EverBenched ever kept out of projects for 1 month or more
  • ExperienceInCurrentField experience in current field
  • LeaveOrNot Whether employee leaves the company in the next 2 years

Calculate PCA

There are two approaches that I will describe to solve PCA in this article, namely with numpy and scikit-learn.

NumPy

First we calculate the mean vector, using mean function from NumPy.

mean_vector = np.mean(X_scaled, axis=0)
mean_vector

then calculate the covariance matrix

cov_mat = (X_scaled - mean_vector).T.dot((X_scaled - mean_vector)) / (X_scaled.shape[0] - 1)print('Covariance matrix \n%s' %cov_mat)

caculate eigen values and eigen vector

eigen_values, eigen_vectors = np.linalg.eig(cov_mat)print('Eigenvectors \n%s' %eigen_vectors)
print('\nEigenvalues \n%s' %eigen_values)

selecting principal components

# Make a list of (eigenvalue, eigenvector) tuples
eigen_pairs = [(np.abs(eigen_values[i]), eigen_vectors[:,i]) for i in range(len(eigen_values))]
# Sort the (eigenvalue, eigenvector) tuples from high to low
eigen_pairs.sort(key=lambda x: x[0], reverse=True)
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eigen_pairs:
print(i[0])
plt.figure(figsize=(6, 4))
plt.bar(range(10), var_exp, alpha=0.5, align='center', label='individual explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
explained variance each principal component

We can see from plot above first principal component (pc) has close 25%, the second, third and fourth pc has explained variance above 10% and so on. From this plot we can decide which one and how many principal components that you want to implement into the data.

Next I want to create a projection matrix, for this calculation I want to just implement two principal component into the data.

matrix_w = np.hstack((eigen_pairs[0][1].reshape(10, 1),
eigen_pairs[1][1].reshape(10, 1)
))
print('Matrix W: \n', matrix_w)

then we have the new data is the coordinates of the data points on the new space, which is based on my decision that just want to implement two principal components.

X_scaled.dot(matrix_w)

congrats, you just calculate PCA and implement into the data that you have. Next we try the simple approach using scikit-learn.

Scikit-Learn

With scikit-learn we don’t have to write many lines of code like before, because scikit-learn has PCA class that can handle PCA technique. But first I want to get insight about the explained variance

from sklearn.decomposition import PCA
pca = PCA().fit(X_scaled)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlim(0,10,1)
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')

different from before, now we can see the explained variance in each principal component cumulatively. This plot shows almost 90% variance by the first 7 principal components. Therefore we can remove the rest.

# create a PCA instance
pca = PCA(n_components=7)
pca_data = pca.fit_transform(X_scaled)
pca_data

To see explained variation per principal component, you can run this code

print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))

Results

After implementing PCA then I will build classification models that will predict whether employess will leave the company in the next 2 years. In this article I use Logistic Regression, Decision Tree and Random Forest algorithms to build the models. And for evaluation metrix using confusion metric, precision, recall and f1-score.

First I want to tell you, this data that I use is imbalance and I’m not use oversampling or undersampling techinique, because this problem outside of our main goals.

imbalance data

so don’t be surprised when you see the f1-score for class 0 is higher than class 1.

Logistic Regression

  • Logistic Regression
confussion matrix logistic regression
  • Logistic Regression + PCA

Decision Tree

  • Decision Tree
  • Decision Tree + PCA

Random Forest

  • Random Forest
  • Random Forest + PCA

Conclusions

There was very little difference in the results of the models that used PCA and those that did not. Whereas as we know that information from a data will be reduced when implemented with PCA. But it can also be seen that the impact of PCA is not very visible when the data used does not have too many features and the machine learning algorithms that we use is not complex so the potential for overfitting is very small, which is one of the reasons we use PCA.

References

[1] S.Sehgal, etc. 2014 “Data Analysis Using Principal Component Analysis”. International Conference on Medical Imaging, m-Health and Emerging Communication Systems (MedCom).

[2] K.Keerthi Vasan, B.Surendiran. 2016 “Dimensionality reduction using Principal Component Analysis for Network Instrusion Detection” Perpective in Science Vol.8, 510–512

[3] S.P. Mishra, etc. 2017 “Multivariate Statistical Data Analysis — Principal Component Analysis (PCA)” International Journal of Livestock Research Vol 7

[4] M.R. Mahmoudi et al. 2021 “Principal component analysis to study the relations between the spread rates of COVID-19 in high risks countries” Alexandria Engineering Journal Vol 60, 457–464.

--

--

No responses yet