Understanding Decision Tree!!

Abhigyan

Published in

Analytics Vidhya

9 min readSep 6, 2020

What is Decision Tree?

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.
Decision tree learning is one of the predictive modelling approaches used in statistics and machine learning. It uses a decision tree to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves).
Decision Trees are a non-parametric supervised learning method used for both classification and regression tasks.
The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Types of Decision Tree:

Regression tree

Decision Tree which has a continuous target variable(ex.: the price of a house).

In order to build a regression tree,we first use recursive binary splitting,i.e. each data sample is taken as a node to split the data.
The point with which there is minimum residual sum of squared errors or mean squared error is taken to be the node of split.
Each nodes stops splitting when it reaches a limit meaning further splits will have less than n number of obsrvations.
To get a intuitive understanding behind the split i highly recommend this video of StatQuest.

Classification tree

Decision Tree which has a categorical target variable.(ex.: in titanic data whether as passenger survived or not).

It is very similar to regression trees,however Residual sum of squared is not used to split the nodes.
It uses few other techniques which we will discuss below.

Assumptions of Decision Tree:

Since the Decision Tree is Non-statistical approach it makes no assumptions of the training data or prediction residuals; e.g., no distributional, independence, or constant variance assumptions.

However , there are a few Non-Statistical assumptions of the decision tree:

In the beginning, the whole training set is considered as the root.
Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
Records are distributed recursively on the basis of attribute values.
Order to placing attributes as root or internal node of the tree is done by using some statistical approach.

How Does a Decision Tree work?

It works for both categorical and continuous input and output variables.
We split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.
The decision of how the splits are made heavily affects a tree’s accuracy. The decision criteria is different for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable.
However,The problem is the greedy nature of the algorithm.Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

Different Types of Algorithms for Decision Tree

→ ID3

ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan[1] used to generate a decision tree from a dataset.
ID3 does not guarantee an optimal solution.
It can converge upon local optima.
It uses a greedy strategy by selecting the locally best attribute to split the dataset on each iteration.
The algorithm’s optimality can be improved by using backtracking during the search for the optimal decision tree at the cost of possibly taking longer.
ID3 can overfit the training data. To avoid overfitting, smaller decision trees should be preferred over larger ones.This algorithm usually produces small trees, but it does not always produce the smallest possible decision tree.
ID3 is harder to use on continuous data than on factored data (factored data has a discrete number of possible values, thus reducing the possible branch points). If the values of any given attribute are continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time consuming.
It mainly works on calculating Entropy and Information Gain.

https://en.wikipedia.org/wiki/ID3_algorithm

→ C4.5 Algorithm

C4.5 is an extension of Quinlan’s earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.
C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy.
Handling both continuous and discrete attributes — In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it

https://en.wikipedia.org/wiki/C4.5_algorithm

→ Classification and regression trees (CART)

Classification and regression trees (CART) are a non-parametric decision tree learning technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric
Decision trees are formed by a collection of rules based on variables in the modeling data set:
1. Rules based on variables’ values are selected to get the best split to differentiate observations based on the dependent variable.
2. Once a rule is selected and splits a node into two, the same process is applied to each “child” node (i.e. it is a recursive procedure).
3. Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met. (Alternatively, the data are split as much as possible and then the tree is later pruned).
Each branch of the tree ends in a terminal node. Each observation falls into one and exactly one terminal node, and each terminal node is uniquely defined by a set of rules.
It is used for binary classification.
It uses the least square as a metric to select features in the case of the Regression tree.
A very popular method for predictive analytics is random forests.
They can even help detect Anomalies/Outliers.
Read More About them in this article from the Neptune blog.

→Chi-square automatic interaction detection(CHAID)

Chi-square automatic interaction detection (CHAID) is a decision tree technique, based on adjusted significance testing.
CHAID is often used in the context of direct marketing to select groups of consumers and predict how their responses to some variables affect other variables, although other early applications were in the field of medical and psychiatric research
Like other decision trees, CHAID’s advantages are that its output is highly visual and easy to interpret. Because it uses multiway splits by default, it needs rather large sample sizes to work effectively, since with small sample sizes the respondent groups can quickly become too small for reliable analysis.
One important advantage of CHAID over alternatives such as multiple regression is that it is non-parametric.

Different Splitting Criteria of Decision trees

→ Entropy:

A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous).
ID3 algorithm uses entropy to calculate the homogeneity of a sample.
It ranges between 0 to 1, where the value closer to 1 shows that we have a pure split and the value closer to 0 shows that

→ Information Gain:

The information gain is based on the decrease in entropy after a data-set is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).
Entropy is the criterion for calculating information gain.
The information gain is based on the decrease in entropy after a dataset is split on an attribute. It is the main parameter used to construct a Decision Tree. An attribute with the highest Information gain will be tested/split first.
Information gain = base entropy — new entropy

→ Gini Index:

Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified. It means an attribute with lower gini index should be preferred.
Gini index used by CART algorithm

To understand the maths behind Gini-index please refer this Link.

→ Variance:

This is the Splitting metrics used by Decision Trees when the target variable is Continuous.
It checks for the variance for every splits and takes the split which explains lesser variance.

S is presplit sample indices.
S_t is set of sample indices for which the split test is true.
S_f is set of sample indices for which the split test is false.

To get a better understanding of the splits done for continuous features you can refer to this youtube video from StatQuest.

Greedy nature of Decision Trees

A decision tree considers a lot of features and allows a sequence of split based on these features.
It is important to note that they are a very general class of models and can attain high accuracies but the task of finding the optimal decision tree is not easy to perform manually but computationally it can be done and therefore, we use what is called a greedy approach to choose our decision tree.
The important thing to remember here is that we would not get the optimal decision tree using this process , However, We will get a good Decision Tree for predicting our target data.
Another thing to note is that we can actually split the data such that any object gets completely split by itself and we would obtain each observation in a leaf node with the correct prediction.
This would give us a 100% accuracy on our training dataset but this is something that is called over-fitting and we would always want to avoid doing this.
This is because such a model is not a generalized one and it won’t be able to predict well on new and unseen examples.

Advantages and Disadvantages:

Advantages:

Easy to use and understand.
Can handle both categorical and numerical data.
Resistant to outliers, hence require little data pre-processing.
New features can be easily added.
Can be used to build larger classifiers by using ensemble methods.

Disadvantages:

Prone to over-fitting.
Require some kind of measurement as to how well they are doing.
Need to be careful with parameter tuning.
Can create biased learned trees if some classes dominate.

HAPPY LEARNING!!!!!

Like my article? Do give me a clap and share it, as that will boost my confidence. Also, I post new articles every Sunday so stay connected for future articles of the basics of data science and machine learning series.

Also, do connect with me on LinkedIn.