Iris Data set Analysis using KNN

Mayank Tiwari
Analytics Vidhya
Published in
4 min readJul 18, 2020

So you’ve done all the reading part, you know how an algorithm works; you know what all you can do with a given data; how to handle data, you know everything, but you don’t know from where to start your data science journey.

Today I’ll guide you through this “Hello World” program of Data Science. This will give you a clear idea of how to start with a given data set. Before we start, I’d like to mention that this, Iris data set, is a classification problem i.e., we have to classify that whether it belongs to Versicolor, Setosa or Virginica class on providing certain inputs to our model.

In the analysis and prediction of the iris class, we will use K-nearest neighbor (KNN) algorithm. So let’s START!

  1. We need to import the necessary libraries required and also in order to work on the iris data set, we need to import it from the sklearn library.

2. Now we will see how our data looks.

In the above picture, iris.feature_names is used to see what are the feature names, iris.target_names is used to see the target names which we have already discussed above. There are three target names i.e., Setosa, Versicolor and Virginica. And lastly, iris.data is used to see the actual data which is present in our dataset. This data shows the values of sepal length(cm), sepal width (cm), petal length (cm) and petal width (cm).

3. Before starting with the data we need to convert this data into a DataFrame in order to work on this data. This can easily be done with the help of an in-built function provided in the pandas library.

We can use .head() function to see the top 5 values of the data. And if you wish to see the last 5 values of the data, we can use .tail() function. Now we will look at our target values.

4. Now we will add the target column to our data frame and see the top and last 5 values in the data.

5. Now we will use a function called pairplot which is present in the seaborn library. Pairplot will visualize the features with one another and we will get a visualization chart which will be helpful in selecting only the relevant features which will provide better results.

In the above visualization, we can see how the target values are distributed and what features can help in proper distinction of the target values. If we look closely we find that by selecting petal length(cm) and petal width(cm), we can easily distinguish all the iris class. Now we will plot a separate plot for petal length(cm) and petal width(cm). This can easily be plotted by using matplotlib library.

Above graph clearly shows the distinction between all the three iris target classes.

6. Now we will split the data into training data and test data. It can be easily done by sklearn library.

After importing the KNeighborsClassifier, we will create an object (knn is the object in this case). We will perform the training of the model in a loop so that we can get the most suitable value for the ‘n_neighbors’. Fit the model, perform prediction using ‘knn.predict()’. We will calculate training and testing score and then append it to the respective lists. After this we will plot the training and test score by using the plot() which is present in the matplotlib library. By plotting, we will get an idea of what could be the value for ‘n_neighbors’ so that we can perform training.

The above visualization shows the accuracy results with respect to the k-value. From above visualization it can be drawn that at k=3 is a suitable value. So we will take k=3 for this experiment(you can take k=4 for experimenting with results and see how much it differ from results when k=3).

So we have obtained an accuracy of 96.66% which is a good score. Since it is a balanced dataset, we are using accuracy score. Also, we will see the confusion matrix of the predictions.

So, this is all for this part. I hope you all like it.

--

--