Machine Learning Basics with Examples — Part 4 Decision Trees

Canburak Tümer
7 min readAug 25, 2018

--

Introduction

Decision Tree produced

Wikipedia defines a decision tree as :

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

In my own words; a decision tree is a classification system which is actually a large if-else block. A decision tree consists of nodes starting from the root node and moves until the leaf nodes by branching every level. That’s why it is called a tree.

Algorithms

There are different decision tree algorithms and branching techniques for decision trees.

First let’s talk about two main branching metrics; gini and entropy.

Gini is a metric used for branching, it is likelihood of the misclassification of a new data point. Gini is used in CART algorithm. The formula for the gini is calculated by

Entropy method is also known as information gain on decision trees. It tries to minimize the entropy (lack of predictibility) of the data. If there is only one node coming out of a branch then the entropy is 0. If there is two branches going out from a node with equal probability, then the entropy is 1.

Information gain, chooses the best split for the given node. For a node every possible branching feature is calculated and the best information gain is selected for the branching.

Information Gain Formula
Photo by Markus Spiske on Unsplash

Let’s look into the algorithms. We’re going to look into 2 algorithms which are ID3 and C4.5

ID3 Algorithm is using information gain for branching, it starts from labeled training set and in every split point it calculates information gain for each feature. And selects the best information gain for split.

C4.5 Algorithm is using normalized information gain for branching. It takes the labeled training data, and in each step it check for the features and selects the best splitting feature as a branching feature. It’s difference from ID3 is it can handle missing values and discrete and continuous values. For more information you can check the wikipedia page.

Example

For decision tree example, we are going to use Titanic Dataset. Our aim in Titanic Dataset is to predict the survival status of passengers. We are going to take our dataset from Kaggle, they’ve already splitted the data into training and test files. Also for the final evaluation we will submit our output to Kaggle again.

Data

As mentioned above our data is about Titanic passengers. Let’s have a look to the data.

Photo by Руслан Гамзалиев on Unsplash

First thing is to get some info about dataset and see some sample data from the set.

Column information for dataset

By using sci-kit learn we can see a summary about the dataset and it column information. As we can see in image, we have a total of 891 rows and 12 columns; some of the columns are integer, some are floating point and some are object as datatype. We also see how many non-null values the columns have. For example Cabin only have 204 non-null values, which probably makes this columns unusable for us.

Statistical information for numerical columns

Sci-kit learn also has a nice method for seeing the statistical information for numerical columns. Seeing the min, max, average and quantiles for a numerical value can give some hints about the data.

Finally some sample data

And finally, I look into some sample data. Looking into data can help us to select usable and unusable columns in data. In this data, PassengerId, Name, Ticket and Cabin seems useless at first sight. If we had more domain knowledge about Titanic we may engineer some features from Ticket and Cabin but I do not have this knowledge so I will keep them out.

Now let’s have a look at the columns which we think meaningful, to see if they’re really helpful.

Upper survived / Lower decease passenger counts by class

As it can be seen on the graphic above, although survival counts are almost equal for all the classes, deceased numbers increase as the class number increases (or class lowers). And if we look into rates of survived/deceased for each class we can see that first class passengers have better survival rate. So I will take this column as a meaningful one.

Survived / Fare scatterplot

It is really hard to get some hints from the point on this graph, but thanks to Seaborn library it also provides us a trend line. We can see trend goes higher as the fare rises, also we can hardly see that there are more survived point then deceased point on high fare end. That’s why I will add the Fare feature to the model as a meaningful feature.

I will not write about all the features. But you can find visualization for each of the features on Jupyter Notebook.

Notebook

Code for these examples are can be found on https://github.com/CanburakTumer/medium-ml-basics-examples/blob/master/1%20Decision%20Tree.ipynb as Jupyter Notebook.

I have some comments in notebook, but in case you need more clarification, you’re welcome to comment on this post to ask about any unclear part.

Data Preparation

cleaned = df.drop(['PassengerId', 'Parch', 'SibSp', 'Name', 'Ticket', 'Cabin'], axis = 1) # drop unrelated columns
cleanedAge = cleaned['Age'].fillna(df['Age'].mean()) # fill nulls in Age column with mean age
cleaned = cleaned.drop(['Age'], axis=1) # drop column with missing values
cleaned['Age'] = cleanedAge # add cleaned column to dataframe
cleaned = cleaned.dropna() # drop rows with empty embarked
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(cleaned['Embarked'])
embarked = le.transform(cleaned['Embarked'])
le.fit(cleaned['Sex'])
sex = le.transform(cleaned['Sex'])
cleaned = cleaned.drop(['Embarked', 'Sex'], axis = 1)
cleaned['Embarked'] = embarked
cleaned['Sex'] = sex
cleaned.info()

In data phase we had decided which columns to use. Depending on this decision I have dropped the unused columns. Then I have filled nulls on Age column with the average age.

In second part, I had to encode categorical data columns to numerical ones because sci-kit learn’s decision tree method can not handle non-numerical features.

Finally I have splitted my data into two sets to train and evaluate my models. Although Kaggle provides a test data set it is not usable for a local evaluation since it does not have any class label in test data set.

Model Selection

from sklearn import treeginiClassifier = tree.DecisionTreeClassifier(criterion = 'gini', max_depth=5)
entropyClassifier = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth=5)
giniClassifier.fit(X_train, y_train)
entropyClassifier.fit(X_train, y_train)

Although in a real life scenario you will train different models and compare them to select the best. For this series we are going to train only one type of the model with different hyperparameters.

We are training Gini and Entropy trees, also I am setting max_depth to 5 to avoid overfitting.

yPredGini = giniClassifier.predict(X_test)
yPredEntropy = entropyClassifier.predict(X_test)

After fitting the trees to the training data set, we feed them with the test dataset to see how well they score.

from sklearn.metrics import f1_scoreginiScore = f1_score(y_test, yPredGini)
entropyScore = f1_score(y_test, yPredEntropy)
print "Gini Score : "+ str(giniScore) + "\nEntropy Score : "+ str(entropyScore)
Scores of the different trees

After checking the F1 Scores of both decision trees, I am choosing the tree which uses Gini as branching method.

Final Evaluation

After selecting the model, I have feed Kaggle’s test data into my selected model and written outputs to a file then submitted it to Kaggle System. Final score from the Kaggle can be found below.

Final Score

Roadmap

My blog post series will cover the topic below, and this road map will be updated every once a post is published.

  • Introduction
  • Supervised Learning
  • Classification
  • Decision Trees (this post)
  • Random Forests
  • SVM
  • Naive Bayes
  • Regression
  • Unsupervised Learning
  • Clustering
  • Feature Selection and PCA
  • Send Models to Production

Conclusion

We have completed the post about decision trees, one of the most used and most explainable learning algorithm.

Thanks for reading, and if you have any questions or comments please do not hesitate to comment on the post.

If you liked the post and find it useful, share or give some claps. Thank you!

--

--

Canburak Tümer

Cloud Data Engineer @ Google Cloud | Data, Coding and Travel enthusiast