Decision tree Intuition

Published in

GreyAtom

5 min readJul 24, 2017

I am sure you are using Decision Trees in your day to day life without knowing it. For example,

Imagine you only do four things at the weekend: go shopping, watch a movie, play tennis or just stay in. What you do depends on three things: the weather (windy, rainy or sunny); how much money you have (rich or poor) and whether your parents are visiting.

Interpreter: You say to your yourself: if my parents are visiting, we’ll go to the cinema. If they’re not visiting and it’s sunny, then I’ll play tennis, but if it’s windy, and I’m rich, then I’ll go shopping. If they’re not visiting, it’s windy and I’m poor, then I will go to the cinema. If they’re not visiting and it’s rainy, then I’ll stay in.

Thus, Decision tree is a type of supervised learning algorithm. It works for both categorical and continuous (regression) input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.

Decision trees use a criteria (there are multiple criteria available) to decide to split a node in two or more sub-nodes
The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable
Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

Why Decision Trees?

Easy to Understand and Interpret.
Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed.
Able to handle both numerical and categorical data.
Able to handle multi-class problems.
Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret
Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
The final decision tree can explain exactly why a specific prediction was made, making it very attractive for operational use.

Important Terminology related to Decision Trees

Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning.
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

How does Decision Tree work?

There are multiple algorithms written to build a decision tree, which can be used according to the problem characteristics you are trying to solve. Regression trees are used when dependent variable is continuous and Classification trees are used when dependent variable is categorical.

Few of the commonly used algorithms are listed below:

ID3
C4.5
CART
CHAID (Chi-squared Automatic Interaction Detector)

Though the methods are different for different decision tree building algorithms but all of them work on the principle of Greediness. Algorithms try to search for a variable which gives the maximum information gain or divides the data in the most homogeneous way.

If you want to dig deep into the above mentioned algorithms then check out this link http://www.shogun-toolbox.org/static/notebook/current/DecisionTrees.html

In-short..

There are multiple metrics used by decision trees in order to find out the best split variables.

Entropy

A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is equally divided it has entropy of one.

Mathematically,

Information Gain

Entropy gives measure of impurity in a node. In a decision tree building process, two important decisions are to be made — what is the best split(s) and which is the best variable to split a node.

Information Gain criteria helps in making these decisions. Using a independent variable value(s), the child nodes are created. We need to calculate Entropy of Parent and Child Nodes for calculating the information gain due to the split. A variable with highest information gain is selected for the split.

Check out this link for more information on entropy and information gain-http://www.saedsayad.com/decision_tree.htm

Limitations:

Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in detailed below).
Not fit for continuous variables: While working with continuous numerical variables, decision tree looses information when it categorizes variables in different categories.

How to avoid “Over fitting” ?

Pruning is a technique in machine learning that reduces the size of decision tree by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier and hence improves predictive accuracy by the reduction of over-fitting.

In other words, it will check for the best split instantaneously and move forward until one of the specified stopping condition is reached.

We first make the decision tree to a large depth.
Then we start at the bottom and start removing leaves which are giving us negative returns when compared from the top.

Suppose a split is giving us a gain of say -10 (loss of 10) and then the next split on that gives us a gain of 20. A simple decision tree will stop at step 1 but in pruning, we will see that the overall gain is +10 and keep both leaves.

I hope this post give you a quick introduction to Decision tree intuition in machine learning.

Thanks for reading.