Decision Tree Learning — Intro To Machine Learning #2

Published in

Simple AI

6 min readJan 21, 2017

Hi, this is the second Article of the series Introduction To Machine Learning, If you missed the first part you can find it here (depending on your background that might not be required for reading this article). Here is the Github Repo for the code used here.

Agenda:

Learning Algorithms
Function Approximation
Iris Dataset
Classification Problem
Testing Data
Decision Tree Learning
Model Training
Model Evaluation
Resources

Learning Algorithms

When we think of machine learning algorithms, we describe them as learning a target function that best maps input variables (X) also called features to learn an output variable (Y).

Y = f(X)

Function Approximation

In a learning task we would like to make predictions in the future (Y) given new variables (X). We don’t know what the function (f) really looks like, logically if we knew we would not need to train machine learning algorithms.

Problem Setting

(1). Set of possible instances X

(2). Unknown target function f: X -> Y

(3). Set of function hypotheses H={ h | h : X -> Y }

Input:

Training examples

Output:

Hypothesis h ∈ H that best approximates target function f

You can think of training machine learning algorithm as search/optimization problem to get the function that predict well our data and often we choose a loss function to evaluate how well or bad is our algorithm learning (later on this series we’ll see the different ways we can evaluate performance of ML algorithms).

Iris Dataset

We are going to use the Iris Dataset, this dataset contains Sepal Length, Sepal Width, Petal Length, Petal Width and Species. The sepal and petal length and width are real-valued and species (categorical) can assume three different categories: Iris Setosa, Iris Virginica and Versicolor. Our goal here is to predict the species using a classifier.

Left: Iris virginica, Center: Iris versicolor, Right: Iris setosa

Classification Problem

In a classification problem we are trying to predict a discrete number of labels (Y) given our features X.

Extracted Features X = {Sepal Length, Sepal Width, Petal Length, Petal Width}

Labels Y = {Iris Setosa, Iris Versicolor, Iris Virginica}

There are 150 training examples, 50 examples for each specie, To train a model we must transform all categorical values to numbers.

e.g. Iris Setosa = 0, Iris Versicolor = 1, Iris Virginica = 2.

Fortunately scikit-learn come with this dataset and it’s already prepared for us, but don’t think it will always be like this, most of the time we’ll have to deal with datasets coming in all kind of formats(e.g. .csv, .json, .data, etc).

The first line I’m importing the Iris dataset from sklearn datasets and second one I’m just loading the iris dataset.

iris feature names

feature matrix X

Display 3 training examples from the matrix X

y is vector (reason why it’s written in lowercase) containing the output labels and there are three possible values: 0, 1 or 2.

We are going to train this data on Decision Tree Classifier.

Testing Data

Whenever we train an algorithm we want to evaluate model performance and we do that using the trained model to classify unseen examples, in this case for brevity I’m going to leave six examples to be used for test, but in the upcoming articles I will show you a better and simpler way for doing that.

Decision Tree Learning

In this case, each hypothesis h in H is a decision tree and we want to look for the tree that best maps the input variables (X) to its labels (Y). The goal of Decision Tree is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Model Training

Here the code for training decision tree classifier:

First we initialize the Decision Tree Classifier with the default parameters, second we use the fit() method to train our model passing the features (X, in this case train_data) and labels (y), for scikit-learn in fit() method actually is where the machine learning algorithm training process takes place.

Model Evaluation

From our trained model now we want to predict unseen data(test data) which in this case use are just using the six examples we left. Now lets compare this predicted label with the true ones.

From this you can see our Classifier did a good job. But actually it would be good to evaluate our model using a specif metric to measure model performance, suppose your goal is to get your model scoring at 70%, yes, next article we are going to learn some practical ways to calculate model accuracy.

If you want to visualize the final Decision Tree created from this data you can find it along on the Github Repo as .pdf or opening the IPython Notebook. If you get bug or other issues with the code, please let me know, I will love to help you.

Resources:

Videos:

Visualizing a Decision Tree by Josh Gordon

I tried to make this article short, but I think I had troubles doing that, just let me know what do you think, any questions or something you want to say feel free to get in touch.

In the next article we are going to talk about Evaluation metrics and the famous pandas library and use it on the Titanic Dataset, by the end of the next article you will have a working solution you can use to join The Titanic Kaggle Competition.

Let me know what you think about this, If you enjoyed the writings then please use the ❤ heart below to recommend this article so that others can see it.

Happy learning.