Using clustering for feature engineering on the Iris Dataset
Here is the link to the Jupyter Notebook containing all the code on Github
Introduction
Clustering is a technique used to explore the underlying data. Various algorithms used for clustering have various definitions of how to create clusters.
We will be using a clustering technique known as k-means in our analysis. k-means is a partitioning algorithm that partitions the data space into k clusters and uses the following steps to achieve it:
- Randomly choose k centres for k clusters to be formed. We can call these points as pseudo centres.
- Assign each data point to the nearest pseudo-center. By doing so, we have just formed clusters, with each cluster comprising all data points associated with its pseudo-center.
- Recalculate the centre of each cluster. Update the location of each cluster’s pseudo-center after each iteration.
- Repeat this step till pseudo-centres are shifted such that they become the actual centres.
What is the Iris Dataset about?
The Iris Dataset, like the name suggests, is a dataset about the Iris flower and its classes. Below is how the dataset looks like:
The Iris dataset uses the above 4 features(columns) to predict the class of iris flower. There are 3 classes of iris flowers: Setosa, Versicolor and Virginica
The Approach
The above code creates a class called Iris for easily performing clustering and classification.
The function KMeans achieves clustering using the base sklearn KMeans algorithm. The function also provided featured engineering capability using the ‘output’ parameter. If the output parameter is ‘all’, we add the labels to our existing features and if it is ‘one’, only the labels are chosen as the training data.
Results
The first iteration uses just Logistic Regression and measures accuracy. It acts like a control group to see if clustering adds value. Last 2 iterations are based on the output parameter set to ‘all’ and ‘one’. The following are the summary of results obtained.
As we can see from the accuracy scores obtained from the first 3 iterations, kmeans is actually doing a bad job here as compared to plain Logistic Regression.
Let’s see how Kmeans performs when our classifier is Support Vector Machines. The results are shown below:
When the output parameter is set to ‘all’, Kmeans does a good job and is comparable to using Support Vector Machines without clustering.
Learnings and next steps
Let us analyze the results obtained above. Kmeans performs better when labels are added to existing features compared to when they are not.
Clustering techniques can be used to improve performance of your classifiers but will not do much on their own.
Things to try:
- Change your train and test data-set size.
- Change your seed to obtain different metrics
- Try the above on various data-sets.
- Write a blog about it!