Getting started with Machine Learning with Python: Classification and Identification of Iris Flowers.

7 min readAug 11, 2021

Hello folks, today I will take you through machine learning. Machine learning is a branch of artificial intelligence (AI) and computer science that focuses on the use of data and algorithms to imitate the way that humans learn while gradually improving its accuracy. It is a method of data analysis that automates analytical model building. This is an exciting subject that will open your thoughts to more than just flowers, you will learn how to submit lots of information to your machine, allow it to learn by itself and being able to predict when presented with an unknown subset.

In this tutorial, you will learn how to classify flowers based on sample data I collected. We will train our model based on this data, our model then would later be able to predict the kind of flower when we supply the flower attributes, this can also act in the reverse way, where we supply the flower name and get the desired flower attributes. This type of data is unstructured.

Lets Dive into It

Before you get started, you need to have some basics in:

Python Programming Language
Data structures and Algorithms

Our IDE for this tutorial will be Google Colaboratory. I like it because it gives you cool features straight out of the box.

Head to Google Colab, this creates a new Colab and is ready for our lesson.

How Colab appears when you first open it.

2. I have prepared for you some data that we are going to train on, it is in form of CSV. It looks as shown below. I have hosted it here https://modcom.co.ke/datasets/iris.csv or https://drive.google.com/file/d/1V3LcwJ1n9p4s3IYBFrMFGwCcpqRx7LU-/view?usp=sharing

The data set we are going to train on -150 Records of flowers

3. We will import pandas, a library that will help us to read the above CSV file. Then create a variable called data, which will hold the results we get after reading the .csv file. We print out the data to check if we are actually getting the desired results.

import pandasdata = pandas.read_csv("https://modcom.co.ke/datasets/iris.csv")print(data)

All data present in our file will be printed as shown below. (After clicking the black -play button from Colab)

When working with Machine learning models, you need to clean your data by checking for nulls before you start performing your training. So we will check if some null data exists for our data set. We will write code to check and sum all the nulls in our data, in case there is. Add this to a new cell on Colab by clicking Insert -> Code Cell

data.isnull().sum() # check empty and sum by colm

The last portion shows there are no nulls in every one of our columns and rows in our data set, which is a great indication

4. Let's try to analyze our data in basic terms i.e Count, mean, min, max, std, and variance. This is just to give us an overview of what data we have on an estimation. Insert a new code cell and type the code below.

# basic statsdata.describe()

5. Next step is to split the data into two portions, the predictors and the predicted/ target. They will be represented by X and Y.

# Step 1: Split to X, Y# X reprents the predictors/features/inputs# Y predicted - target/outputarray = data.valuesX = array[:, 0: 4]    # sepallength sepalwidth  petallength petalwidth, - 1Y = array[:, 4]  # class

X values(Predictors, the actual values of flower dimensions)

Let's Train the Model!

6. To train our model, we are going to use one of the best libraries called scikit-learn, this will be fundamental in helping us in performing classification, regression, clustering, and model selection. For more check here https://scikit-learn.org/stable/ we will add a new cell and import sklearn specifically, the model selection module.

Based on the information we have above, (X, Y) We will split our data into two portions, 70% and 30%. We will train this 70% to make our model more intelligent and use the remaining 30% that our model hasn’t seen, and it should be able to get it accurately. For every of our split data, we will have X_train, X_test, Y_train, Y_test and use the model selection to do the training. we will use a random state of 42 to increase the number of accurate predictive analyses.

from sklearn import model_selectionX_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,test_size=0.30,random_state=42)

We are going to perform a classification machine learning, the target is a categorical variable. By this I mean, we will supply any given sample date of a flower, and the program should be able to tell me which flower matches the information. I will list several classifiers to help you come up with analyses for the above data. Add a new cell and import the following modules from sklearn. We will use just one of these, but at least be aware of the rest.

from sklearn.tree import DecisionTreeClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.naive_bayes import GaussianNBfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVC# Its a classifiction , the target is categorical var

We will use the GaussianNB, Because it is a relatively accurate classifier, and so it will accurately identify the correct flower without any failure rates

We might also want to back up our model into an SAV file for future prediction, so we will import the pickle library, which will be responsible for this. We will then fit the model classifier (GaussianNB) with X_train and Y_train for classification. The code below creates a new file called finalized_model.sav inside a file called sample_data (sample_data/finalized_model.sav) and saves the model in there.

model = GaussianNB()model.fit(X_train, Y_train)import pickle# save the model to diskfilename = 'sample_data/finalized_model.sav'pickle.dump(model, open(filename, 'wb'))loaded_model = pickle.load(open(filename, 'rb'))

Time to test the saved model!!

We then test the saved model by getting the predictions of the sample (X_test) against the expected Y_test.

# Step 4 : test the modelpredictions = loaded_model.predict(X_test)print('Model Predicted ', predictions)print('Expected ', Y_test)

This is what your prediction will look like.

How do we check for prediction Accuracy?

To check whether your model got it right, we use the accuracy score from sklearn. Our model should be at least 70 % accurate for us to consider it as true. Let's print the classification reports, the confusion matrix, and metrics.

# Step 5: check accuracyfrom sklearn.metrics import accuracy_scoreprint ('Accuracy ', accuracy_score(Y_test, predictions))from sklearn.metrics import classification_reportprint(classification_report(Y_test, predictions))from sklearn.metrics import confusion_matrixprint(confusion_matrix(Y_test, predictions))

Boom!!, our accuracy is 98%, which is a good sign, so we can use our model to predict the kind of flower parameters we have.

Time to predict an unknown flower with various dimensions to predict its accurate flower name. Our flower will have the following :

sepallength, sepalwidth, petallength, petalwidth and our task is to accurately determine its class.

flower = [[1.3, 1.4, 1.4, 1.5],[3.4,2.3,3.4, 3.4]]

flower = [[1.3, 1.4, 1.4, 1.5],[3.4,2.3,3.4, 3.4]]predicted = model.predict(flower)print('The flower are likely to be ', predicted)

Clearly, our model was super intelligent to detect that the flower class was Iris- Virginica

Try this sample model with your own data by allowing console input, the same way you would do with python by the use of ‘input()’

flower = [[float(input('Sepal le')), float(input('Sepal wi')), float(input('Petal le')), float(input('Petal win'))]]predicted = model.predict(flower)print('The flower are likely to be ', predicted)