In previous blog, Understanding Principle Component Analysis, Part-1. We saw by using only univariate or bivariate analysis won’t help us to extract info from data set consisting number of variable. Thus by using Multivariate analysis we can overcome demerits of univariate or bivariate analysis. Principle Component Analysisis one the method of Multivariate analysis. We also saw important mathematical term which need to know before knowing Principle Component Analysis(PCA).
This blog we will see more insight of PCA i.e how to implement using python, how it work, how it help it extract information from multiple variable data set.
Principle Component Analysis:
Data consisting of multiple variable its hard to find pattern in such a place where we don’t have much access to graphical representation. In such case Principle Component Analysis dose help to analysis data. The main merit of using the PCA is that you can find pattern in data by reducing the variable, without much loss of information form datasets.
This blog will take you through all the necessary steps need to perform in Principle Component Analysis on data.
In this method, we will use Wine data-sets from UCI repository, we are going to use pandas library. Using Pandas library in Python make much of work easy and simple to code. This photo show loading data and spliting it in X(working variable) and y(target variable)
Our Wine data consist of 178 x 3 matrix where the columns are the different features, and every row represent wine variate.
To standardise the data before PCA on covariance matrix depen on measure of scale on features.The PCA is responsible to making the features subspace which can help to maximize the variance along the axes, if preferred to standardise it to unit scale (mean=0 and variance=1). Standardzing can be implemented using sklearn.
The description of the Covariance Matrix is explain in previous blog. We can determine covariance matrix by two ways either by using mean of the standardise X variable features or either by using numpy covariance method.
In below pic we determine covariance matrix directly using numpy on standardise features. Thus using numpy covariance syntax we don’t need to find mean and standard deviation separately. This dose our work easier.
Using either of the method we get output same, but to reduce the complexity second method is preferred.
4)Eigenvectors and Eigenvalues:
Here we have find eigenvalue and eigenvectors together just by using and ‘for loops’ and numpy.
Graphically visualization of the above all code we see how all three target variable are spread across graph. It show three variable are form as individual cluster. Here we took dot product of the X_std(standard deviation) as above we find. and plot the graph showing three clusters.
Though we implement the PCA using python syntax and numpy. But there is alternative way which more simple and shortcut. We can do with the help of sklearn_PCA. Below code show the method using sklearn done on train set and test set.
Contribute to Machine-Learning-A-Z development by creating an account on GitHub.
Unlike previous which we cover only theoretical definition of mathematics term, in this blog we cover practical implementation using python syntax and numpy. Then we saw graphical representation of the dataset. Though this look simple, but sklearn provide us more straight way to implement PCA. Hope this blog have help you to understand Multivarient and Principle Component Analysis.