Principal Component Analysis (PCA) is a technique used very commonly in fields of machine learning and artificial intelligence. The technique aims at reducing the dimensionality of data and in some cases also helps in the removal of noise.
PCA effectively reduces the size of your data (by reducing dimensionality from d to k where k < d) and thereby speeds up their training procedures. Why PCA is so popular is because it does dimensionality reducing while minimizing the amount of information lost from the data, making it a good choice.
In this tutorial, I plan on replicating the results of another PCA blog (reference here — https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60). The blog I referred to does the task in Python. Let’s take a look at how we can do PCA and visualize it in Pharo.
Loading the Dataset
The Iris Dataset can be downloaded from here.
To load the dataset we will utilize a Pharo library called DataFrame. DataFrame gives us a tabular data structure to do data analysis in Pharo.
| dataset f Xmatrix scale X y pca scaled_X reduced_X graph setosa versicolor virginica a b c|f := <path_to_file> asFileReference.dataset := DataFrame readFromCsv: f.
dataset columnNames: #( 'slength' 'swidth' 'plength' 'pwidth' 'target' ).
dataset removeRowAt: 150.dataset do: [ :row |
row at: #slength transform: [ :element | element asNumber ].
row at: #swidth transform: [ :element | element asNumber ].
row at: #plength transform: [ :element | element asNumber ].
row at: #pwidth transform: [ :element | element asNumber ].
The data code is fairly simple. The first line is variable declarations that we will be using throughout the code. In the second line, we declare our file pointer to load utilizing DataFrame. DataFrame provides utility to read from CSV files directly (line 3), however, if the file formatting for your needs is slightly different refer to this blog by Atharva Khare. It covers how to approach writing a file reader. In the 4th line, we give the columns name and remove the last row from our dataframe as it is a row of null values from the file. Finally, while reading everything is taken in as strings. However, there are floating point values and we want to do mathematical operations on them so we convert them into numbers as necessary.
It should look something like this,
Applying PCA to our Data
Now we want to apply PCA on our dataset. So first let’s separate out our data from our labels.
X := dataset columnsFrom: 1 to: 4.
y := dataset columnsFrom: 5 to: 5.
We now have 2 data frames X, y. Let’s now convert the X dataframe into a matrix. The matrix and numerical library in Pharo is called PolyMath. It’s a great library with regular updates and some really passionate people working behind it.
Converting our Dataframe into a matrix,
Xmatrix := PMMatrix rows: ( X asArrayOfRows ).
Now, we want to scale our matrix and standardize our data with a mean = 0 and variance = 1. The importance of standardizing is covered here. PolyMath offers a way to conveniently scale and fit our matrix of choice so we simply do the following to standardize our data.
scale := PMStandardizationScaler new.
scale fit: Xmatrix.
scaled_X := DataFrame withRows: ( (scale fitAndTransform: Xmatrix) rows ) columnNames: #( 'slength' 'swidth' 'plength' 'pwidth' ).Xmatrix := (PMMatrix rows: scaled_X).
All we need to do now is apply PCA on our data.
pca := PMPrincipalComponentAnalyserJacobiTransformation new componentsNumber: 2.
pca fit: Xmatrix.
Xmatrix := (pca transform: Xmatrix).
reduced_X := DataFrame withRows: ( Xmatrix rows ).
reduced_X addColumn: (y column: 'target') named: 'target' atPosition: 3.
PolyMath offers two methods of applying PCA one being what we see above using PMPrincipalComponentAnalyserJacobiTransformation and other way being PMPrincipalComponentAnalyserSVD. Currently, the SVD method is undergoing some fixes so we use the Jacobi Transform. Our data should now look like this,
Our data has now been projected into a 2-dimensional space from a 4-dimensional space.
Visualizing the Data
Now that we have the data in 2-D, we can plot it on the XY plane and see how it looks. For visualization in Pharo, we use a library called Roassal.
We need to first split the data into the 3 classes of Iris Setosa, Versicolor and Virginica.
a := OrderedCollection new.
b := OrderedCollection new.
c := OrderedCollection new.(reduced_X) do: [ :row | ( (row at: 'target') = 'Iris-setosa') ifTrue: [ a add: (row asArray )] ].(reduced_X) do: [ :row | ( (row at: 'target') = 'Iris-versicolor') ifTrue: [ b add: (row asArray )] ].(reduced_X) do: [ :row | ( (row at: 'target') = 'Iris-virginica') ifTrue: [ c add: (row asArray )] ].
The above is an unoptimized but simple way to split our 3 into the 3 subsequent classes.
Now to plot the data,
graph := RTGrapher new.setosa := RTData new.
versicolor := RTData new.
virginica := RTData new.setosa dotShape color: Color red.
versicolor dotShape color: Color blue.
virginica dotShape color: Color green.setosa points: (a).
setosa x: [:vect | vect at: 1].
setosa y: [:vect | vect at: 2].
setosa label: 'setosa'.versicolor points: (b).
versicolor x: [:vect | vect at: 1].
versicolor y: [:vect | vect at: 2].
versicolor label: 'versicolor'.virginica points: (c).
virginica x: [:vect | vect at: 1].
virginica y: [:vect | vect at: 2].
virginica label: 'virginica'.graph add: setosa.
graph add: versicolor.
graph add: virginica.graph axisX title: 'Principal Comp 1'.
graph axisY title: 'Principal Comp 2'.
graph legend below.
Define a graph using RTGrapher, along with 3 groups of data, one for each of our classes. Give them the dot color as preferred (example — “setosa dotShape color: Color red.”). Then we add our groups of points to our graph and apply the groups along with the axes, and we finally get our plot.
Hopefully, this will serve as a reference for anyone interested in doing data analysis in Pharo and a guide on how to get started.