The “Hello world” in Data Science and Machine Learning

5 min readOct 3, 2017

Frequently, in computer science has the (hello world) starting point in the new programming language, technique or framework. In Data Science and Machine Learning (mostly in Machine Learning), the hello world is a task classification using the dataset Iris (link).

In Machine Learning, the classification task consist in train a machine learning model able for recognize different patterns in datasets and predict a label class correctly. This learning/training process for classification problem is called “supervised learning”, for that is necessary know a priori the label for each pattern. Through relation between pattern and class label, the machine learning model is adjusted with the goal to reduce the classification error.

In dataset Iris, the classification problem consists on classify each by type flowers: Iris Setosa, Iris Versicolour or Iris Virginica. A pattern example is the vector: [5.1, 3.5, 1.4, 0.2] with four elements/features. The first feature is sepal length in cm, the second feature is sepal width in cm, the third is petal length in cm, and the fourth feature is petal width in cm.

For writing our ‘hello world’ we will use Jupyter Notebook, Python 3 and Scikitlearn . The process can be divided into three steps: data processing, model train, and model test.

[ 1ª Step ] Data Processing

In this step, we need to load the dataset Iris for train and test a machine learning model. For this tutorial, we will use the Iris dataset provide by Scikitlearn. That dataset has your data ready for use in machine learning model.
The python scripts for that is:

After executing this command lines, we have all patterns and labels of dataset Iris. For know which dataset size, you execute this scripts:

The first output is (150, 4), are respective 150 patterns (50 for class Iris setosa, 50 for class Iris versicolour, and 50 for class Iris virginica), each pattern with four features. When this division dataset is equally for each class, is called balanced dataset. This, makes easier (not always) the training process. The second output is (150,), this is length of label vector and composed by: [0] for representing the label Iris setosa, [1] for representing the label Iris setosa, and [2] for representing the label Iris versicolour.

For train and test model is a necessary use different sample. One simple way is divide dataset in two sample: sample train and sample teste. The python script for that is:

Divide dataset in two samples: train and test

The parameter ‘test_size’ is responsible for defining the proportion of patterns for each sample (train and test sample). In this case, we use the value 0.3 for test_size, that is 30% of all patterns for make sample test.

Now we have the samples for train and test a machine learning model.

[2ª Step] Train Machine Learning Model

In this step, we do train a model using Sckitlearn. The machine learning model very popular is the KNN (k-nearest neighbors) (link). This model is nontrainable and nonparametric (no existing parameters for being fit), also is instance-based learning, there is no process for training the model, your prediction is based on distance (for example Euclidian distance) for each instance. The main idea of KNN is to classify a new pattern for the label more frequent in K neighbors. Because your simplicity the KNN can be considered o best model for initiate in Machine Learning and Data Science.

The python script for use KNN through Scikitlean is:

These scripts create a model through KNeighborsClassifier. The parameter ‘n_neighbors’ is responsible for defining how many instances in train sample with small distance for the new instance (pattern) is used for classifying.

For train machine learning model using Scikitlearn, the following python script is used:

The output shows all parameters of the model. For more details this parameter you can get here.

Now we have a model, and we can do predict new patterns.

[3ª Step ] Model Test

In this moment, we go evaluate the model for new patterns. These patterns are test samples. For that, we use the following script for obtaining all labels predictions:

Predict label for new patterns

Now we use a metric of accuracy for calculating the scoring of a model for new patterns. For that, we use the following script:

In output, we have the value score: 0.97777. In other words, the KNN model predicts correctly 97% of test samples and is a very good model for the task of predict class in Iris dataset.

Summary

# The process for making a solution for the task of classification must have at least three steps: data pre-processing, model train, and model test.

# Through of Scikitlearn for all steps we need only a few lines scripts;

# The KNN model is simple and no existing parameters for being fit

You can visualize script used in this post through github

P. S.

This is my ‘hello world’ in Medium Blog