Analyzing data using Principal Component Analysis (PCA)
A guide to knowing and understanding PCA using Python.
What is Principal component analysis?
The principal component analysis is an unsupervised learning technique abbreviated as PCA. It is also called general factor analysis. It is used to study the interrelations among a set of variables so as to figure out the underlying structure of those variables. It is used to analyze data.
How does it work?
PCA produces several orthogonal lines which fit the data well. Orthogonal lines are the lines perpendicular to each other in the n-dimensional space. So if a regression line is created, then a line perpendicular to this line will be the orthogonal line. Now the concept of components comes into the picture. Components are a linear transformation that selects a variable system for the data set in such a way that the first greatest variance of the dataset lies on the first axis, the second greatest variance on the second axis and so on. This procedure reduces the number of variables that will be used during analysis.
How to implement using Python?
The dataset used here will be the inbuilt scikit learn data of breast cancer. Using PCA, data will be transformed and find out which features explain the most variance in the data.
→ Import libraries
The basic libraries to handle data like pandas and numpy are imported. Along with it for visualization, matplotlib and seaborn.
>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> %matplotlib inline
→ Read data
The breast cancer data is imported from sklearn.
>>> from sklearn.datasets import load_breast_cancer
>>> data = load_breast_cancer()>>> data.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])>>> df = pd.DataFrame(data['data'],columns=data['feature_names'])
→ PCA visualization
PCA will be used to find the first 2 principal components and visualize the data in a new 2-D space. To do this, the data is scaled so that each feature has single unit variance. From sklearn, the standard scalar module is imported and its object is fit on the data.
>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler()>>> scaler.fit(df)
Now, this data is transformed.
>>> scaled_data = scaler.transform(df)
Now from sklearn, PCA is imported and its object is created. The number of components is specified. Then the principal components are found using the fit() method. Then using the transform() function, rotation and dimensionality reduction are done.
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=2)>>> pca.fit(scaled_data)
The data is transformed into its first 2 principal components.
>>> x_pca = pca.transform(scaled_data)
So now the original data and the transformed data can be compared to see that the transformed data has 2 components.
So now from 30 dimensions, it is reduced to 2. The two components can be plotted.
>>> plt.xlabel('First principal component')
>>> plt.ylabel('Second Principal Component')
So using the two components, classification can be done easily.
→ Understanding the components
The components are the combinations of the original features. They are stored as a feature of the fitted PCA object. So when we look at the PCA components below which are in the form of a NumPy matrix, each row represents a principal component and each column corresponds to the original features.
array([[ 0.21890244, 0.10372458, 0.22753729, 0.22099499, 0.14258969,
0.23928535, 0.25840048, 0.26085376, 0.13816696, 0.06436335,
0.20597878, 0.01742803, 0.21132592, 0.20286964, 0.01453145,
0.17039345, 0.15358979, 0.1834174 , 0.04249842, 0.10256832,
0.22799663, 0.10446933, 0.23663968, 0.22487053, 0.12795256,
0.21009588, 0.22876753, 0.25088597, 0.12290456, 0.13178394],
[-0.23385713, -0.05970609, -0.21518136, -0.23107671, 0.18611302,
0.15189161, 0.06016536, -0.0347675 , 0.19034877, 0.36657547,
-0.10555215, 0.08997968, -0.08945723, -0.15229263, 0.20443045,
0.2327159 , 0.19720728, 0.13032156, 0.183848 , 0.28009203,
-0.21986638, -0.0454673 , -0.19987843, -0.21935186, 0.17230435,
0.14359317, 0.09796411, -0.00825724, 0.14188335, 0.27533947]])
A heatmap can be drawn to show the relationship between the features and the principal components.
>>> comp = pd.DataFrame(pca.components_,
Refer to the notebook here.