Simple classification problem with sklearn and iris flower data set

Victor Bona
Café & Tech
Published in
5 min readSep 17, 2020

A simple approach to implement classifications models using sklearn and iris flower data set as example.

TL DR: Ok. Ok, just give me the code: https://gist.github.com/vicotrbb/1375aca1b5a64363caec9cc8c65eca12

Some useful links to important information will be provided at the end of the article.

Photo by Pietro Jeng on Unsplash

Classification problems

Let’s say your boss asks you to code an way to classify the customers of a store in some classes based on customer attributes(shopping rate, income, age, family size and etc…) or a school hire you to code something to help teachers to understand what kind of students they have based on student school attributes. All of this problems can be solved using Machine learning and neural networks.

Basically, we can create a machine learning model to classify something based on data.

As an example, we will try to identify the species of a plant based on it’s flower attributes(sepal length, sepal width, petal length and petal width), using this data we will create a model to discover what species the plant is.

The data

Usually used as the “Hello world” data set to start machine learning studies.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. Wikipedia

Summary Statistics

It’s important to remember, this data set will be loaded from sklearn.datasets preloaded samples and because of that, we don’t need to clean or perform any kind of modeling on the data.

The target column that represents the plant species is in number format, so, we will use this dict as reference:

The model

There is several kinds of classification models we could use for this problem and probably, almost all of them would perform very well, but, after some tests, I realized the MLP classifier suits very very well to this case, so, we will be using it as our example.

But, what actually is the MLP classifier?

Accordingly to wikipedia “A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons (with threshold activation); see § Terminology. Multilayer perceptrons are sometimes colloquially referred to as “vanilla” neural networks, especially when they have a single hidden layer.”

Okay, Okay. If we read something like this and have never studied about it, probably will be a little confusing, but, is not that hard if we explain in more simple words: A multilayer perceptron is supervisioned neural network that uses a technique called backpropagation for training and a nonlinear activation function. Usually used to solve classification problems.

See? Not that hard!!!!?! Take a look at the links for better understanding.

Ok, let’s code it

Code it is very easy when using sklearn, but, remember, the data we are using are already well fitted and ready to go, in real world problems the data will need to be processed and hyperparameters calibrated. Data processing are one of the most important part when creating a ML model.

As you see, the code is very easy to write and actually easy to understand. Now we can guess the plant species, imputing the attributes we saw, let’s get a sample and try it:

Sample

And predict it using our model, we expect the result 1.

Sample predict

As we can see we got the correct result. Checking our reference we see the sample we got, is an Iris-Versicolour species. Other thing we can try is to check our model score using the test data we create previously, check it out:

Our model has an impressive accuracy score, but, the problem here is we used just data the model already knows and this don’t explain a lot to us, we need to use unknown data to really validate our model accuracy, that’s exactly why we did this previously:

This part of the code took 5 random samples from the data set, in order to use as a validation data, the reason we do things like this, is to see how te model performs with unkown data and really have an idea of his accuracy, so, let’s try it out:

Validation test

There we are, we got an excellent result with our validation data.

Conclusion

We saw a very summarized implementation of a neural network using sklearn with data pre-formatted and ready to go, usually, the problems are never that easy to solve and demands hours of analysis and hard work.

The purpose of this article is just give a simple example about implementing your first neural network and present to you some concepts and links to search and narrow your studies.

--

--

Victor Bona
Café & Tech

I am a Software Engineer working at Valari and a Machine learning enthusiast 🧠 aiming to create breakthrough innovations.