Decision Trees- An Introduction to Theory for Beginners (Part 1)

8 min readMar 22, 2022

Suppose you are a writer for a popular fiction blog and are asked to write a new story for this month’s post.

Starting from scratch would be tough, right?

You would probably check for any user analytics data that you could read. After going through the stats you come to know that most read posts are related to stories of adventures of protagonist to an unknown land. You also found that people between age group 13 and 19 are reading more love stories than other groups. Most commented posts are one involving high fantasy stories, probably because people like to discuss more about them.

You concluded that writing a story involved with adventures, a love story and a high fantasy backdrop makes sense for a popular post.

So what you did above is central to what decision trees actually does, they partition the problem at hand based on different inputs given that they want to maximize a metric. In your case you wanted to maximize popularity of your posts, so you partitioned your readers and finally you’re off to write a great story with high probability of getting popular(Well, a lot depends on writer also!!).

Decision Tree

Decision Tree are non parametric (assumes no specific distribution for data) supervised learning methods which helps in solving both classification and regression problems.

It basically uses a recursive partitioning scheme for breaking the feature space into smaller regions(using parallel axis rectangles), where it uses metrics like majority voting (in case of classification) or averaging (in case of regression) and hence any data point from that region will be predicted to have the metric value.

So how does Decision Tree Learn?

DT uses following basic steps while learning:

Partition feature space if possible, otherwise stop
If partitions are created, partition on those partitions
Go back to Step 1

You should be thinking, “ So basically partitioning is the central idea, I got it”, right? No, it ain’t that easy as it seems, there is a lot going under the hood as we will see.

Apparently above learning algorithm would be computationally hard, because partition problem itself is a NP-complete(Computationally impossible), so we use greedy approach (Greedy choice property) for recursive partitioning and thus make it computationally feasible. Basically, we use some measure and partition based on the increase in that measure in a greedy manner rather than checking all possible combinations.

One more problem that arises often is that the tree created from above steps could be very complicated and might overfit the data i.e. the tree has learnt the training data very well but is not able to learn the general pattern which makes it perform bad on test set( or real data ). For reducing this problem, we could use pruning, where we cut down(prune) complex parts of the tree, to make sure that it learns the important pattern. There are different methods which help us do that which we will see later.

Partitioning

As we seen above that, a greedy approach is used for creating partitions.

The new partitioned regions created needs to ensure that they have lower disorder than the original region. For achieving this, we need to find two things:

Input variable on which to split(like var1 or var2)
The split point value of the input variable (like var1 >6 , so split point is 6)

In general case we typically follow a procedure which which goes through all pairs of k inputs and their s best split points

Generally, we create two partitions but creating more than two partitions is also possible, so why we keep partitions binary? well, because it helps us keep our model simple and scalable.

Ok, but where is the tree ?

You must probably be wondering that we have seen how does Decision Trees learn, but we didn’t see any trees anywhere, right?

Well, the recursive partitioning divides the feature space and at each step, new partitions are created and further partitioned, till we reach the point where further splitting isn’t possible or a stopping criterion is met.

If we visualize this process, it resembles the tree data structure, with last unsplittable points being called the leaf of the tree, while the first partition step being the root and all splitting points represented as nodes.

Visualizing sklearn decision tree classifier

So Decision Tree is basically an upside down tree!!

Decision Tree Algorithm Requirements

criterion to choose best split (feature and split value)
criterion to stop splitting or stop growing the tree
rule for predicting the class (classification) or value (regression) of an observation

The second requirement is most tricky:

If tree too deep: Overfitting, high variability
If tree not enough deep: Underfitting, high bias

Check more on bias-variance tradeoff here.

For this we have many parameters and by tweaking them we can control the depth, we’ll check them later in the post.

How to choose best split/partition?

We check if new partitions created are good, if they:

are reducing disorder(heterogeneity) as compared with original space
are increasing homogeneity inside splitted regions (most data points should are similar and belong to same class or have similar average)

Now, we need a function for measuring disorder or impurity, using which we can compare impurity in original feature space and splitted feature space, and thus find the split which maximizes this difference of impurity between both, this difference is also called Information gain.

Information Gain

It’s a measure which helps us find the most ”informative” split.

Informative split means the split which reduces maximum impurity/disorder between parent node and children node.

IG = disorder in parent node - average of disorder in children nodes

Now, let’s see what are the required properties of the impurity/disorder function:

It should have high value if classes present in partition are of both type in same proportion
It should have lower value if classes present in partition are homogeneous

Below metrics helps us just do this:

Misclassification Rate

It is the fraction of observations not belonging to the most common class.

where p represents the proportion of observations in the region we consider that are from the i-th class.

Gini Index or Impurity

It is a measure of how often a randomly chosen element from the set would be incorrectly labelled if it was randomly labelled according to the distribution of labels in the subset. It takes on a small value if all of the p are close to zero or one.

where m is the number of classes and p is the probability of incorrectly assigning an element to class i.

Example : Suppose we have 2 classes red and blue and 10 points, where 5 belong to red class and 5 belong to blue class.

now, we will iterate over all classes and and try to find probability of incorrectly classifying the data points

Hence for maximum impurity (i.e. equal number of both classes), gini returns maximum value of 0.5

Maximum Value : 0.5 (equal proportions of classes)

Minimum Value : 0 (only single class is present)

Entropy or Deviance

where m is the number of classes, and p is the proportion of observations belonging to class i.
Entropy is the expected number of bits needed to encode class of a randomly drawn observation.

Information Theory interpretation: Optimal length code assigns = log(p) bits to message having probability p.

Example: Suppose you a dataset with 16 observations, out of 16, 9 observations belong to class A and 7 observations belong to class B.

Maximum Value : 1 (equal proportions of classes)

Minimum Value : 0 (only single class is present)

Check out this nice post by Sebastian Raschka on why misclassification rate is not preferred now days.

So basically, every time we have to partition, we simply calculate the current impurity using one of the above methods and find all possible split feature and their split value combinations, then we choose the combination whose calculated impurity on being subtracted from parent node impurity gives maximum value.

So, when to stop growing the tree?

image from : https://thebathmagazine.co.uk/flash-fiction-words-count/oak-tree-free-image-from-pixabay/

Moving to second requirement, we have to know when to stop splitting the nodes further, so that we can find the tree which doesn’t overfit or underfit the data.

But Why?

Basically decision tree will fit the training set 100% (given there is no noise in labels) with no stopping criteria, but that doesn’t mean that it will have the same performance on unseen test set. This means that what DT has learnt doesn’t generalize well or simply putting not useful in predicting future observations.

Hence, we try to keep our tree complex enough to not underfit but also not overfit the training data.

Below are few ways just to control size of tree:

Fixed depth

Given a predetermined depth and don’t grow tree beyond this depth.

Minimum number of samples per leaf

Given a node, we only split it, if number of samples exceed some value k.

Increase in Information gain

Given a node, we only split, if the best split increases Information Gain (IG) beyond a given value k, which is the minimum required IG to split.

Note:

It should be noted that above stopping criterion for controlling size of tree are not fully reliable or satisfactory, because the tree growing method is greedy. Every split is nearsighted, i.e. we don’t check all possible combinations (which are infinite). So a split which might be bad in a given step might lead to very good splits in next steps and this strange behavior leaves a lot of room for reliability.

References:

Very informative lecture slides by Julie Scholler here .

That’s all for part 1.

In next part we’ll see how prediction rules are used to make our trees able to predict discrete as well as continuous outputs and we’ll review pros and cons of decision trees and compare them with other ML algorithms.

This post is first part from the posts on decision trees.