Frequently, in computer science has the (hello world) starting point in the new programming language, technique or framework. In Data Science and Machine Learning (mostly in Machine Learning), the hello world is a task classification using the dataset Iris (link).
In Machine Learning, the classification task consist in train a machine learning model able for recognize different patterns in datasets and predict a label class correctly. This learning/training process for classification problem is called “supervised learning”, for that is necessary know a priori the label for each pattern. Through relation between pattern and class label, the machine learning model is adjusted with the goal to reduce the classification error.
In dataset Iris, the classification problem consists on classify each by type flowers: Iris Setosa, Iris Versicolour or Iris Virginica. A pattern example is the vector: [5.1, 3.5, 1.4, 0.2] with four elements/features. The first feature is sepal length in cm, the second feature is sepal width in cm, the third is petal length in cm, and the fourth feature is petal width in cm.
[ 1ª Step ] Data Processing
In this step, we need to load the dataset Iris for train and test a machine learning model. For this tutorial, we will use the Iris dataset provide by Scikitlearn. That dataset has your data ready for use in machine learning model.
The python scripts for that is:
After executing this command lines, we have all patterns and labels of dataset Iris. For know which dataset size, you execute this scripts:
The first output is (150, 4), are respective 150 patterns (50 for class Iris setosa, 50 for class Iris versicolour, and 50 for class Iris virginica), each pattern with four features. When this division dataset is equally for each class, is called balanced dataset. This, makes easier (not always) the training process. The second output is (150,), this is length of label vector and composed by:  for representing the label Iris setosa,  for representing the label Iris setosa, and  for representing the label Iris versicolour.
For train and test model is a necessary use different sample. One simple way is divide dataset in two sample: sample train and sample teste. The python script for that is:
The parameter ‘test_size’ is responsible for defining the proportion of patterns for each sample (train and test sample). In this case, we use the value 0.3 for test_size, that is 30% of all patterns for make sample test.
Now we have the samples for train and test a machine learning model.
[2ª Step] Train Machine Learning Model
In this step, we do train a model using Sckitlearn. The machine learning model very popular is the KNN (k-nearest neighbors) (link). This model is nontrainable and nonparametric (no existing parameters for being fit), also is instance-based learning, there is no process for training the model, your prediction is based on distance (for example Euclidian distance) for each instance. The main idea of KNN is to classify a new pattern for the label more frequent in K neighbors. Because your simplicity the KNN can be considered o best model for initiate in Machine Learning and Data Science.
The python script for use KNN through Scikitlean is:
These scripts create a model through KNeighborsClassifier. The parameter ‘n_neighbors’ is responsible for defining how many instances in train sample with small distance for the new instance (pattern) is used for classifying.
For train machine learning model using Scikitlearn, the following python script is used:
The output shows all parameters of the model. For more details this parameter you can get here.
Now we have a model, and we can do predict new patterns.
[3ª Step ] Model Test
In this moment, we go evaluate the model for new patterns. These patterns are test samples. For that, we use the following script for obtaining all labels predictions:
Now we use a metric of accuracy for calculating the scoring of a model for new patterns. For that, we use the following script:
In output, we have the value score: 0.97777. In other words, the KNN model predicts correctly 97% of test samples and is a very good model for the task of predict class in Iris dataset.
# The process for making a solution for the task of classification must have at least three steps: data pre-processing, model train, and model test.
# Through of Scikitlearn for all steps we need only a few lines scripts;
# The KNN model is simple and no existing parameters for being fit
You can visualize script used in this post through github
This is my ‘hello world’ in Medium Blog