PCA, LDA and PLS exposed with python — part 1: Principal Component Analysis
In this post I want to consider the main differences between PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis) and PLS (Partial Least Squares) algorithms and their use in a typical problem of classification/regression. I will be using python and its implemented packages in sklearn. The second part of this tutorial will focus on LDA and PLS.
PCA is a dimensionality reduction technique, widely used now in machine learning as unsupervised learning. It is widely used in the field of chemometrics and multivariate analysis.
We will see LDA which will be an useful algorithm to apply PCA algorithm to classification and then we will use PLS which can be used both for classification and for regression.
Let’s start!
We start by importing the classical libraries:
The dataset we are going to work with is a generated one: we are interested in the height and weight of people with features = Sex, Height, Weight. Our dataset is very simple, here are the statistics:
There are some very old people in there!
In the form of graph, on two axis, we have the graph below.
As we can see it seems easy to estabilish a classification between male and female from this dataset. As for distribution plot shown below we can see that there is a significant overlap between the two classes, more pronounced in the height distribution.
PCA
We are now interested in the following question: as it is difficult to plot more than two variables together, can PCA help us?
PCA best works by reducing the dataset along the direction of maximum variance of our variables. For our dataset this implies mostly following the relationship between height and weight. We have 4 variables so we can start working with 3 principal components.
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df) #fit the scaling to our dataframe and transform the data
pca = PCA(n_components=3) #covariant Matrixx_pca = pca.fit_transform(scaled_data)variance = pca.explained_variance_ratio_ #calculate variance ratios
var=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=3)*100)
print(var)
The variable “var” indicates how much variance is explained by each principal component. In our case, starting with 4 variables, if we use 3 of them we will have 99.4% of information; if we use 2 we will have almost 82% of variance explained.
What PCA does is to rotate the axis in the direction of maximum variance and then apply a linear combination of variables to our data. We are interested in knowing the loadings and the scores for our dataset: the loadings plot is a plot of the direction vectors that define the model.
From the picture below we can see that in the first principal component we find Height and Weight and Sex_male. This means that these 3 variables are highly correlated between them. On the second principal component we have Age. Since this variable is approximately 0 on the x axis, it means it is independent from the others.
Let’s look at the scores plot:
We added all the dimensions here to make some points. The first point we look at is the colour of the points. As we can see the red points are mostly on the positive x axis while blu ones are on the left axis. We saw from the loadings plot that on the positive x axis there’s the variable Sex_male. So when the variable increase, it shifts on the right.
With the same logic, the points from the bottom to the top of the scores plot shows an increase in Age. In fact younger people will be on the bottom and oldest people on the top of the plot.
If we want to see how the components depend on the variables, we can plot the loadings for each component:
As we can see the third component is strongly related to the sex of the person, while the second is basically describing the age of that person. So let’s plot again the two components but now we pick 1st and 3rd component.
As we can see, by selecting the appropriate component, we can well separate the classes.
However, the sex information was in our dataframe variables. What happens if we don’t have it available for a new observation? Is it possible to run the PCA model on a dataframe without that information? Will PCA be as informative as it has been with all informations?
If we remove this information from the dataset, our model ofcourse discriminate the classes less well, however we are able to still see a good splitting. Note however that this time the plot is very similar to the original scatterplot we did.
In the next section we will explore additional ways to separate the classes of a dataset.