Learn Decision Trees with Kaggle Example

Lalit Kishore Vyas
9 min readApr 23, 2019

--

Easy Digestible Theory + Kaggle Example = Become Kaggler

Let’s start the fun learning with the fun example available on the Internet called Akinator(I would highly recommend playing with this).

Now, let’s first learn the key concept of Decision Trees Algorithm such as Entropy.

Entropy comes from Physics & to explain this we will use the example of three states of water.

Entropy is how much freedom a particle has to move around so, we comment on the value of Entropy for the different states of water as low, medium & high.

The notion of Entropy can also be looked with the help of probabilities like a different configuration of balls in the given containers. Remember higher the chances of arranging the balls higher the Entropy.

Entropy can also be learned with the help of a concept called Knowledge Gain. Taking this example of balls if we have to pick a random ball how much do we know about the color of the ball. In the first bucked we know for sure that the ball is red so we have high knowledge. In the second bucket, it is likely to be red and not likely to be blue, so, if we bet the color of a randomly picked ball is red then we will be right most of the time. So, we can say we have medium knowledge of the color of the ball. In the third bucked it is equally likely to be blue or red, so, we have less knowledge about the color. And it turns out that the knowledge & Entropy are opposite.

So, in order to cook up the formula for Entropy, we will consider the following game. We will consider the configuration red, red, red & blue and we will put them inside the bucket. Now, what we do is we pull four balls from the bucket with repetition and we try to get the initial configuration(which is red, red, red & blue of this order) and if we get this configuration we win else we fail.

Now the question is

So, let’s find out the probability of one by one.

Now the products of probabilities are confusing mainly because of two reasons —

  1. Let’s say we have a thousand balls and if we take multiplication of probabilities(which are always between 0 & 1) then the number will be very very tiny.
  2. The other reason is a small change in one of the factors can change the outcome drastically.

So, we need something better than products which is sum & how it can be achieved is by taking Log because as we know-

We get the value of Entropy by the definition is the average of negatives of logs of probabilities of picking a ball in a way that we win the game.

So, in case of a slightly general case, we get

Another key concept is Information Gain which can be derived from the Entropy as —

Now, let’s build a Decision Tree —

Our Algorithm will be very simple look at the possible splits that each column gives — calculate the information gain — pick the largest one.

Again — We choose the tree which gives the largest amount of information gain.

Let’s suppose we have a problem of recommending apps based on the given play store data.

The Entropy will be as above, now, if we split them first on the basis of Gender

And when we split on the basis of occupation

As the Information gain for Occupation is greater so we will pick that first and our tree will look like this —

Hyperparameters for Decision Trees

In order to create decision trees that will generalize to new problems well, we can tune a number of different aspects about the trees. We call the different aspects of a decision tree “hyperparameters”. These are some of the most important hyperparameters used in decision trees:

Maximum Depth

The maximum depth of a decision tree is simply the largest possible length between the root to a leaf. A tree of maximum length kk can have at most 2^k2k leaves.

Maximum depth of a decision tree

Minimum number of samples to split

A node must have at least min_samples_split samples in order to be large enough to split. If a node has fewer samples than min_samples_split samples, it will not be split, and the splitting process stops.

Minimum number of samples to split

However, min_samples_split doesn't control the minimum size of leaves. As you can see in the example on the right, above, the parent node had 20 samples, greater than min_samples_split = 11, so the node was split. But when the node was split, a child node was created with that had 5 samples, less than min_samples_split = 11.

Minimum number of samples per leaf

When splitting a node, one could run into the problem of having 99 samples in one of them, and 1 on the other. This will not take us too far in our process, and would be a waste of resources and time. If we want to avoid this, we can set a minimum for the number of samples we allow on each leaf.

Minimum number of samples per leaf

This number can be specified as an integer or as a float. If it’s an integer, it’s the minimum number of samples allowed in a leaf. If it’s a float, it’s the minimum percentage of samples allowed in a leaf. For example, 0.1, or 10%, implies that a particular split will not be allowed if one of the leaves that results contains less than 10% of the samples in the dataset.

Decision Trees in sklearn

In this section, you’ll use decision trees to fit a given sample dataset.

Before you do that, let’s go over the tools required to build this model.

For your decision tree model, you’ll be using scikit-learn’s Decision Tree Classifier class. This class provides the functions to define and fit the model to your data.

>>> from sklearn.tree import DecisionTreeClassifier
>>> model = DecisionTreeClassifier()
>>> model.fit(x_values, y_values)

In the example above, the model variable is a decision tree model that has been fitted to the data x_values and y_values. Fitting the model means finding the best tree that fits the training data. Let's make two predictions using the model's predict() function.

>>> print(model.predict([ [0.2, 0.8], [0.5, 0.4] ]))
[[ 0., 1.]]

The model returned an array of predictions, one prediction for each input array. The first input, [0.2, 0.8], got a prediction of 0.. The second input, [0.5, 0.4], got a prediction of 1..

Hyperparameters

When we define the model, we can specify the hyperparameters. In practice, the most common ones are

  • max_depth: The maximum number of levels in the tree.
  • min_samples_leaf: The minimum number of samples allowed in a leaf.
  • min_samples_split: The minimum number of samples required to split an internal node.

For example, here we define a model where the maximum depth of the trees max_depth is 7, and the minimum number of elements in each leaf min_samples_leaf is 10.

>>> model = DecisionTreeClassifier(max_depth = 7, min_samples_leaf = 10)

Now, let’s take a Kaggle problem (Titanic Survival Model with Decision Trees)

# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
# Pretty display for notebooks
%matplotlib inline
# Set a random seed
import random
random.seed(42)
# Load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)
# Print the first few entries of the RMS Titanic data
display(full_data.head())
  • Survived: Outcome of survival (0 = No; 1 = Yes)
  • Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
  • Name: Name of passenger
  • Sex: Sex of the passenger
  • Age: Age of the passenger (Some entries contain NaN)
  • SibSp: Number of siblings and spouses of the passenger aboard
  • Parch: Number of parents and children of the passenger aboard
  • Ticket: Ticket number of the passenger
  • Fare: Fare paid by the passenger
  • Cabin Cabin number of the passenger (Some entries contain NaN)
  • Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Since we’re interested in the outcome of survival for each passenger or crew member, we can remove the Survived feature from this dataset and store it as its own separate variable outcomes. We will use these outcomes as our prediction targets.
Run the code cell below to remove Survived as a feature of the dataset and store it in outcomes.

# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)
# Show the new dataset with 'Survived' removed
display(features_raw.head())

Preprocessing the data

# Removing the names
features_no_name = features_raw.drop(['Name'], axis=1)
# One-hot encoding
features = pd.get_dummies(features_no_name)
features = features.fillna(0.0)
display(features.head())

Training the model

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)
# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier
# Define the classifier, and fit it to the data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

Testing the model

# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
The training accuracy is 1.0
The test accuracy is 0.815642458101

Improving the model — by playing with the hyperparameters

Trying to specify some parameters in order to improve the testing accuracy, such as:

  • max_depth
  • min_samples_leaf
  • min_samples_split

We can use your intuition, trial and error, or even better, feel free to use Grid Search!

# Training the model
model = DecisionTreeClassifier(max_depth=6, min_samples_leaf=6, min_samples_split=10)
model.fit(X_train, y_train)
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculating accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
The training accuracy is 0.870786516854
The test accuracy is 0.854748603352

Congratulations!! Now you can also become Kaggler

Hit the Clap button if you like the work!!

Happy Learning!! will be back with more fun tutorials :)

--

--