Start to learn Machine Learning with the Iris flower classification challenge

Implementing a solution to classify species of iris flowers using machine learning and Python

Felipe Trindade
gft-engineering
9 min readOct 18, 2019

--

My colleagues in GFT have been doing amazing projects applying the state of the art in machine learning and deep learning. From regressions and classifications to reinforced learning, and from computer vision to natural language processing, they have applied all available technologies and techniques to solve problems for the finance and industry sectors. Watching what they were doing, I had the urge to start reading, learning and practicing about this topic. In this article, I will cover one of the first steps I took to learn about machine learning: implementing one of the most iconic problems in machine learning: the Iris Flower Classification problem.

All the code related can be found in the following gist on GitHub:

https://gist.github.com/felipextrindade/a476a590ffac2c9021656a2d0ab2e8ad

Introduction

Machine learning is about extracting knowledge from data. It is a research field at the intersection of statistics, artificial intelligence, and computer science and is also known as predictive analytics or statistical learning. The application of machine learning methods has become present in everyday life. From recommendations of which movies to watch, to which products to buy and recognising your friends on social media, machine learning algorithms that learn from input/output pairs are called supervised learning algorithms because a “teacher” provides supervision to the algorithms in the form of the desired outputs for each example that they learn from. As creating a dataset of inputs and outputs is often a manual process, supervised learning algorithms are well understood and their performance is easy to measure. As stated before, we will be covering the Iris Species classification problem — a typical test case for many statistical classification techniques in machine learning.

Requirements

In this project, we will need the following requirements:

Anaconda: Python 3.6 Distribution, which already has the language’s best and widely used libraries for data science (such as scipy, matplotlib, NumPy and Pandas). Anaconda Navigator also comes with Jupyter Notebook, Spyder and VSCode editors.

Visual Studio Code: a versatile and powerful text editor, and all-purpose IDE.

Download Anaconda (Python 3.6 distribution):
https://repo.anaconda.com/archive/Anaconda3-5.2.0-Windows-x86_64.exe

Download standalone VSCode (if you have any problems with the Anaconda’s installation):
https://code.visualstudio.com/

Creating our machine learning model

Configuring and using Visual Studio Code

After installing Anaconda successfully, open Visual Studio Code and hit Ctrl + Shift + P. In the field shown above in the editor, search for “Python: Select Interpreter”

Select the Python interpreter

In the following context, make sure to select Anaconda. That way, we will already have all the dependencies needed for our coding in the base Python installation.

To run your code inside VSCode, you can do one of the following:

Right-click in the code and hit “Run Python File in Terminal”.

Or open the terminal (using Ctrl + “) and type:

Understanding the scenario

Let’s assume that a hobby botanist is interested in distinguishing the species of some iris flowers that she has found. She has collected some measurements associated with each iris, which are:

  • the length and width of the petals
  • the length and width of the sepals, all measured in centimetres.

She also has the measurements of some irises that have been previously identified by an expert botanist as belonging to the species setosa, versicolor, or virginica. For these measurements, she can be certain of which species each iris belongs to. We will consider that these are the only species our botanist will encounter.
The goal is to create a machine learning model that can learn from the measurements of these irises whose species are already known, so that we can predict the species for the new irises that she has found.

An example of iris flower and the features of our model
Images of the three classes of iris considered in this model

Importing the libraries

First of all, let’s import the modules as listed above:

  • SkLearn is a pack of Python modules built for data science applications (which includes machine learning). Here, we’ll be using three particular modules:
  • load_iris: The classic dataset for the iris classification problem. (NumPy array)
  • train_test_split: method for splitting our dataset.
  • KNeighborsClassifier: method for classifying using the K-Nearest Neighbor approach.
  • NumPy is a Python library that makes it easier to work with N-dimensional arrays and has a large collection of mathematical functions at its disposal. It’s’ base data type is the “numpy.ndarray”.

Building our model

As we have measurements for which we know the correct species of iris, this is a supervised learning problem. We want to predict one of several options (the species of iris), making it an example of a classification problem. The possible outputs (different species of irises) are called classes. Every iris in the dataset belongs to one of three classes considered in the model, so this problem is a three-class classification problem. The desired output for a single data point (an iris) is the species of the flower considering it’s features. For a particular data point, the class / species it belongs to is called its label.
As already stated, we will use the Iris Dataset already included in scikit-learn.
Now, let’s print some interesting data about our dataset:

Output:

The individual items are called samples in machine learning, while their properties are called features. The shape of the data array is the number of samples multiplied by the number of features. In this case: our data has 150 samples with 4 features each (sepal length (cm), sepal width (cm), petal length (cm), petal width (cm)).

Output:

The target array contains the species of each of the flowers that were measured. This array is composed of numbers from 0 to 2.
The meaning of those numbers are directly related to our target names (classes):

  • setosa (0)
  • versicolor (1)
  • virginica(2)

Output:

To test the model’s performance, we show it new data for which we have labels (data that is already classified).
This is usually done by splitting the labelled data we have collected (in this example, our 150 flower measurements) into two parts. One part of the data is used to build the machine learning model, and is called the training data or training set (which we will call x_train and y_train). The rest of the data will be used to test how well the model works; this is called the test set, test data, or hold-out set (which we will call X_test, y_test).
scikit-learn has a function that shuffles and splits the dataset: the train_test_split function.
This function extracts 75% of the rows in the data as the training set with the corresponding labels. The remaining 25% of the data with the remaining labels will be used as the test set.
The arguments are: train_test_split(samples, features, random seed) — and it returns 4 datasets.

Printing the shape of the train samples, along with their respective targets:

Output:

Same for the test samples:

Output:

Now we can start building the actual model. We will use a k-nearest neighbors classifier.
To make a prediction for a new data point, the algorithm finds the point in the training set, then it assigns the label of this training point to the new data point.
The k in k-nearest neighbors signifies that instead of using only the closest neighbor to the new data point, we can consider any fixed number k of neighbors in the training (like one, or three neighbors — as the following image shows). We can now make a prediction using the majority class among them. For our example, we will use one neighbor (k=1).

Illustration of the K-Nearest Neighbor approach given K=1 and K=3

Models in scikit-learn are implemented in their own classes. The k-nearest neighbors classification algorithm is implemented in the KNeighborsClassifier class in the sklearn.neighbors module.

Now we can call the fit method of the knn object, which takes as arguments the array X_train (containing the training data) and the array y_train (containing the corresponding training labels).This way, we are building our model on the training set.

We can now make predictions using this model on any new data for which we might not know the correct labels.
Let’s say we found an iris with the following measures:

  • sepal length of 5 cm
  • sepal width of 2.9 cm
  • petal length of 1 cm
  • petal width of 0.2 cm.

What species would this flower be? We can put this data into an array by calculating the shape — the number of samples (1; as we are looking for one flower) multiplied by the number of features (4; sepal and petal measurements):

Output:

We then call the predict method of the knn object:

Output:

Our model predicts that this new iris belongs to the class 0, meaning its classified as a setosa. But how do we know whether we can trust the results of our model?

Measuring the model

The test set that was created was not used to build the model, but we do know the correct species for each iris in the test set. Therefore, we can make a prediction for each iris in the test data and compare it against its label — so we can know if the model is correctly predicting the label for a given flower.
To measure how well the model works, we can obtain the accuracy - the fraction of flowers for which the right species was predicted (number that we can calculate using the NumPy “mean” method, comparing both datasets):

Output:

We can also use the score method of the knn object, which will compute the test set accuracy:

Output:

For this model, the accuracy on the test set is 0.97, which means the model made the right prediction for 97% of the irises in the given dataset. We can expect the model to be correct 97% of the time for predicting the species of new irises.
For a hobby botanist application, this is a high level of accuracy and it means that our model may be trustworthy enough to use.

Summary

Albeit simple, the iris flower classification problem (and our implementation) is a perfect example to illustrate how a machine learning problem should be approached and how useful the outcome can be to a potential user.

References

For more resources about the topic, I recommend the book Introduction to Machine Learning with Python: A Guide for Data Scientists, by Andreas C. Müller & Sarah Guido, which has many hands-on tutorials for machine learning scenarios and also explains in more detail the iris classification problem using the scikit-learn dataset and from which this article was based on.

Other awesome learning resources are:

--

--