Using clustering for feature engineering on the Iris Dataset

Arunabh
GreyAtom
Published in
3 min readAug 1, 2017

Here is the link to the Jupyter Notebook containing all the code on Github

Introduction

Clustering is a technique used to explore the underlying data. Various algorithms used for clustering have various definitions of how to create clusters.

We will be using a clustering technique known as k-means in our analysis. k-means is a partitioning algorithm that partitions the data space into k clusters and uses the following steps to achieve it:

  1. Randomly choose k centres for k clusters to be formed. We can call these points as pseudo centres.
  2. Assign each data point to the nearest pseudo-center. By doing so, we have just formed clusters, with each cluster comprising all data points associated with its pseudo-center.
  3. Recalculate the centre of each cluster. Update the location of each cluster’s pseudo-center after each iteration.
  4. Repeat this step till pseudo-centres are shifted such that they become the actual centres.

What is the Iris Dataset about?

source: Kaggle

The Iris Dataset, like the name suggests, is a dataset about the Iris flower and its classes. Below is how the dataset looks like:

first 5 rows of the Iris Dataset

The Iris dataset uses the above 4 features(columns) to predict the class of iris flower. There are 3 classes of iris flowers: Setosa, Versicolor and Virginica

The Approach

Code used to perform clustering and classification

The above code creates a class called Iris for easily performing clustering and classification.

The function KMeans achieves clustering using the base sklearn KMeans algorithm. The function also provided featured engineering capability using the ‘output’ parameter. If the output parameter is ‘all’, we add the labels to our existing features and if it is ‘one’, only the labels are chosen as the training data.

Example of setting the output feature to all

Results

The first iteration uses just Logistic Regression and measures accuracy. It acts like a control group to see if clustering adds value. Last 2 iterations are based on the output parameter set to ‘all’ and ‘one’. The following are the summary of results obtained.

3 iterations of Kmeans

As we can see from the accuracy scores obtained from the first 3 iterations, kmeans is actually doing a bad job here as compared to plain Logistic Regression.

Let’s see how Kmeans performs when our classifier is Support Vector Machines. The results are shown below:

changing the classifier to SVC

When the output parameter is set to ‘all’, Kmeans does a good job and is comparable to using Support Vector Machines without clustering.

Learnings and next steps

Let us analyze the results obtained above. Kmeans performs better when labels are added to existing features compared to when they are not.

Clustering techniques can be used to improve performance of your classifiers but will not do much on their own.

Things to try:

  1. Change your train and test data-set size.
  2. Change your seed to obtain different metrics
  3. Try the above on various data-sets.
  4. Write a blog about it!

--

--