PCA, LDA and PLS exposed with python — part 1: Principal Component Analysis

Andrea Castiglioni
Analytics Vidhya
Published in
5 min readMar 9, 2020

In this post I want to consider the main differences between PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis) and PLS (Partial Least Squares) algorithms and their use in a typical problem of classification/regression. I will be using python and its implemented packages in sklearn. The second part of this tutorial will focus on LDA and PLS.

PCA is a dimensionality reduction technique, widely used now in machine learning as unsupervised learning. It is widely used in the field of chemometrics and multivariate analysis.

We will see LDA which will be an useful algorithm to apply PCA algorithm to classification and then we will use PLS which can be used both for classification and for regression.

Let’s start!

“sunset” by Zbynek Burival on Unsplash

We start by importing the classical libraries:

Basic libraries we start working with.

The dataset we are going to work with is a generated one: we are interested in the height and weight of people with features = Sex, Height, Weight. Our dataset is very simple, here are the statistics:

description of the age/height/weight of a invented population

There are some very old people in there!
In the form of graph, on two axis, we have the graph below.

Height and Weight of fake dataset.

As we can see it seems easy to estabilish a classification between male and female from this dataset. As for distribution plot shown below we can see that there is a significant overlap between the two classes, more pronounced in the height distribution.

Left graph: weight distribution for the two classes. Right graph: height distribution for the two classes.
Head of the dataframe.

PCA

We are now interested in the following question: as it is difficult to plot more than two variables together, can PCA help us?

PCA best works by reducing the dataset along the direction of maximum variance of our variables. For our dataset this implies mostly following the relationship between height and weight. We have 4 variables so we can start working with 3 principal components.

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df) #fit the scaling to our dataframe and transform the data
pca = PCA(n_components=3) #covariant Matrix
x_pca = pca.fit_transform(scaled_data)variance = pca.explained_variance_ratio_ #calculate variance ratios
var=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=3)*100)
print(var)
Code to scale our data to mean = 0 and standard deviation = 1 and apply PCA.

The variable “var” indicates how much variance is explained by each principal component. In our case, starting with 4 variables, if we use 3 of them we will have 99.4% of information; if we use 2 we will have almost 82% of variance explained.

What PCA does is to rotate the axis in the direction of maximum variance and then apply a linear combination of variables to our data. We are interested in knowing the loadings and the scores for our dataset: the loadings plot is a plot of the direction vectors that define the model.

From the picture below we can see that in the first principal component we find Height and Weight and Sex_male. This means that these 3 variables are highly correlated between them. On the second principal component we have Age. Since this variable is approximately 0 on the x axis, it means it is independent from the others.

Loading plot for our dataset for the first 2 components.

Let’s look at the scores plot:

Scores plot for the first two components of our dataset.

We added all the dimensions here to make some points. The first point we look at is the colour of the points. As we can see the red points are mostly on the positive x axis while blu ones are on the left axis. We saw from the loadings plot that on the positive x axis there’s the variable Sex_male. So when the variable increase, it shifts on the right.
With the same logic, the points from the bottom to the top of the scores plot shows an increase in Age. In fact younger people will be on the bottom and oldest people on the top of the plot.

If we want to see how the components depend on the variables, we can plot the loadings for each component:

Loadings of the 3 components of our dataset.

As we can see the third component is strongly related to the sex of the person, while the second is basically describing the age of that person. So let’s plot again the two components but now we pick 1st and 3rd component.

loadings plot of the first and third component.
scores plot of the first and third component.

As we can see, by selecting the appropriate component, we can well separate the classes.

However, the sex information was in our dataframe variables. What happens if we don’t have it available for a new observation? Is it possible to run the PCA model on a dataframe without that information? Will PCA be as informative as it has been with all informations?

If we remove this information from the dataset, our model ofcourse discriminate the classes less well, however we are able to still see a good splitting. Note however that this time the plot is very similar to the original scatterplot we did.

In the next section we will explore additional ways to separate the classes of a dataset.

--

--