Decision Tree = A Light Intro to Theory + Math + Code

Published in

MetaInsights

11 min readJan 28, 2018

After learning the process of Decision Tree for M.L, I came to know that if we try hard to understand something and that might be beyond our area of subject, we would defiantly be successful if we keep persistence and perseverance. There is nothing in the world that you can’t learn. You just need a strong determination to get into the core knowledge of any domain.

The ability to learn is a hallmark of intelligence

Well, after such a cliche intro here I’m presenting you the Decision Tree in the simpler way possible. In this part, we shall discuss the Theory, Math, and Code. This article is clearly divided into three Section. In the first section you will know the Fundamental & Visual Concept of Decision Tree, In the second part, we will understand Entropy, Gini Impurity, Information Gain and see a pseudo code to understand this better. Finally, we will see a very famous library called sci-kit Learn and get to know the python Code its self.

What is Decision Tree? How does it work?

#1. Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.

#2. It’s a Tree-structure classifier, a decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions. Though a generally used tool in data mining, statistics for deriving a strategy to reach a particular goal, its also widely used in machine learning, which will be the main focus of this article.

#3 Decision Tree can be used for both classification ( Discrete Value Yes/No, 0 or 1) and regression( Continues Value )problem.

Classical Decision Tree is around from the decades but still its most powerful modern variation like Random Forest, Gradient Boosting is the most useful machine learning Technique. (Since the goal of this article is not to teach you about the random forest, gradient boosting we will keep our focus on Decision Tree.)

At this point in time, you might come to know about Visual Idea of a Decision Tree.

Now let’s get into deep drive and understand the fundamental concepts. To know the concepts you need to at least know few Terminology which is mention below.

Root Node: It decides the entire population or sample data should further get divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: This node decides whether/when a sub-node splits into further sub-nodes or not.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node

So Now you should know that Decision Tree can be very intuitive. In case of the real life of problem-solving using Decision Tree, You only need to ask two question.

Que 1# Which attribute should choose as for being the Root/Parent Node? > After deciding the root node?Que 2# When should I Start/Stop Split further Node?

That’s it.!! This was the Theoretical Intro of Decisions Tree. We have successfully understood the aspects of the decision tree. Now we shall discuss how we could make a complex computational decision tree model based upon mathematical understanding.

In following sections we define few mathematical terms related to decision tree and then perform those calculation with sample example.

Most algorithms that have been developed for learning decision trees are variations of the core algorithm that employs a top down, greedy search through the possible space of decision trees.

The major Decision Tree implementations:

ID3, or Iternative Dichotomizer 3, was the first of three Decision Tree implementations developed by Ross Quinlan

CART, or Classification And Regression Trees is often used as a generic acronym for the term Decision Tree, though it apparently has a more specific meaning. In sum, the CART implementation is very similar to C4.5; the one notable difference is that CART constructs the tree based on a numerical splitting criterion recursively applied to the data, whereas C4.5 includes the intermediate step of constructing rule sets.

Later in another article we will be discussing on CART (Classification And Regression Trees )

C4.5, Quinlan’s next iteration. The new features (versus ID3) are: (i) accepts both continuous and discrete features; (ii) handles incomplete data points; (iii) solves over-fitting problem by (very clever) bottom-up technique usually known as “pruning”; and (iv) different weights can be applied the features that comprise the training data. Of these, the first three are very important — and i would suggest that any DT implementation you choose have all three. The fourth (differential weighting) is much less important

C5.0, the most recent Quinlan iteration. This implementation is covered by patent and probably as a result, is rarely implemented (outside of commercial software packages). I have always been skeptical about the improvements claimed by its inventor (Ross Quinlan) — for instance, he claims it is “several orders of magnitude” faster than C4.5. Other claims are similarly broad (“significantly more memory efficient”) and so forth.

CHAID (chi-square automatic interaction detector) actually predates the origianl ID3 implementation by about six years (published in a Ph.D thesis by Gordon Kass in 1980). The R Platform has a Package called CHAID which includes excellent documentation

In this article, we will be focussing on the Iterative Dichotomiser 3, commonly know as the ID3 algorithm because we need to have a ground up knowlegde about Algorithm. Variants/Extensions of the ID3 algorithm, such as C4.5, CART are very much in practical use today.

The ID3 algorithm builds the tree top-down, starting from the root by meticulously choosing which attribute that will be tested at each given node. Each attribute is evaluated through statistical means as to see which attribute splits the dataset the best. The best attribute is made the root, with it’s attribute values branching out. The process continues with the rest of the attributes. Once an attribute is selected, it is not possible to backtrack.

Knowing The Attribute

Pruning

The performance of a tree can be further increased by pruning. It involves removing the branches that make use of features having low importance. This way, we reduce the complexity of tree, and thus increasing its predictive power by reducing overfitting.

Pruning can start at either root or the leaves. The simplest method of pruning starts at leaves and removes each node with most popular class in that leaf, this change is kept if it doesn’t deteriorate accuracy. Its also called reduced error pruning. More sophisticated pruning methods can be used such as cost complexity pruning where a learning parameter (alpha) is used to weigh whether nodes can be removed based on the size of the sub-tree. This is also known as weakest link pruning.

Entropy

Entropy is a measure of unpredictability of the state, or equivalently, of its average information content.

Entropy is a statistical metric that measures the impurity. Given a collection of S, which contains two classes: Positive and Negative, of some arbitrary target concept, entropy with respect to this boolean classification is:

Entropy(S) = -p(positive)log2 p(positive) — p(negative) log2 p(negative)

Where ppositive is the proportion (probability) of positive examples in S and pnegative is the proportion of negative examples in S. Entropy is 1 if the collection S contains equal number of examples from both classes, Entropy is 0 if all the examples in S contain the same example.

The entropy values vs the probabilities for a collection S follows a parabolic curve:

One interpretation of entropy is that, entropy specifies the minimum number of bits required to encode the classification of any member of a collection S.

In general terms, when the classes of the target function may not always be boolean, entropy is defined as

What is Information Gain (IG)?

Now that we know what entropy is, let’s look at an attribute that is more attached to the building of the decision tree — Information Gain. Information gain is a metric that measures the expected reduction in the impurity of the collection S, caused by splitting the data according to any given attribute. Whilst building the decision tree, the information gain metric is used by the ID3 algorithm to select the best attribute — the attribute the provides the “best split” — at each level.

It measures the expected reduction in entropy. It decides which attribute goes into a decision node. To minimise the decision tree depth, the attribute with the most entropy reduction is the best choice!

More precisely, the information gain Gain(S, A) of an attribute A relative to a collection of examples S is defined as:

Information Gain (n) =
  Entropy(x) — ([weighted average] * entropy(children for feature))

S = Each value v of all possible values of attribute A
Sv = Subset of S for which attribute A has value v
|Sv| = Number of elements in Sv
|S| = Number of elements in S

Let’s see how these measures work!

Suppose we want ID3 to decide whether the weather is good for playing baseball. Over the course of two weeks, data is collected to help ID3 build a decision tree. The target classification is “should we play baseball?” which can be yes or no.

See the following table.

The weather attributes are outlook, temperature, humidity, and wind speed. They can have the following values:

outlook = { sunny, overcast, rain }temperature = {hot, mild, cool }humidity = { high, normal }wind = { weak, strong }

We need to find which attribute will be the root node in our decision tree.

Entropy(S) = — (9/14) Log2 (9/14) — (5/14) Log2 (5/14) = 0.940Gain(S,Wind) = Entropy(S) —  { (8/14)* Entropy(Sweak) — (6/14)* Entropy(Sstrong) }= 0.940 — { (8/14)*0.811 — (6/14)*1.00 } = 0.048Entropy(Sweak) = — (6/8)*log2(6/8) — (2/8)*log2(2/8) = 0.811Entropy(Sstrong) = — (3/6)*log2(3/6) — (3/6)*log2(3/6) = 1.00

For each attribute, the gain is calculated and the highest gain is used in the decision.

Gain(S, Outlook) = 0.246
Gain(S, Temperature) = 0.029
Gain(S, Humidity) = 0.151

Clearly, the outlook attribute has the highest gain. Therefore, it is used as the decision attribute in the root node.

Since outlook has three possible values, the root node has three branches (sunny, overcast, rain). The next question is, What attribute should be tested at the sunny branch node? Since we’ve used outlook at the root, we only decide on the remaining three attributes: humidity, temperature, or wind.

S sunny = {D1, D2, D8, D9, D11} = 5 examples from the table with outlook = sunnyGain(Ssunny, Humidity) = 0.970Gain(Ssunny, Temperature) = 0.570Gain(Ssunny, Wind) = 0.019

Humidity has the highest gain; therefore, it is used as the decision node. This process goes on until all data is classified perfectly or we run out of attributes.

The decision tree can also be expressed in rule format:

IF outlook = sunny AND humidity = high THEN play baseball = noIF outlook = rain AND humidity = high THEN play baseball = noIF outlook = rain AND wind = strong THEN play baseball = yesIF outlook = overcast THEN play baseball = yesIF outlook = rain AND wind = weak THEN play baseball = yes

That’s all mathematical understanding you need to understand Decision Tree. If you think this concept is overwhelming then don’t be. Take it as a learning process like any other skills to learn you need to give some Time & constant Motivation.

Now we have been reached the final part, yes the Coding Part. Take time congratulate yourself you have come along so far. Here I’ve used scikit-learn(Machine learning library)please don’t be disappointed after knowing that if I was going to use some BLACK BOX API then what was the purpose to learn theory in depth and length. Well, we will answer this at the end of the code. First, let’s see the code.

This might be astonishing to you that by using some Black Box API we can automatically generate decision Tree. Well, Actully It Is quite fascinating, but hold your fascination and consider this as a basic thing to immediately get started with a decision tree.

At this point in time, you should have a foundational understanding of Decision Tree Classifier. Here in the coding section, we have used most popular python library a.k.s Scikit-learn. But you might be thinking we’ve learned a lot of theory what was the use for that?
Well, To know the Hyperparameters Optimization. Optimization is core to Machine Learning. We will discuss this in the later article. Apart from that, we are expanding various genres of machine learning so to speak your learning strategy should be dynamic and must cope with current time.
On a side note, you should ask yourself this question before getting into Machine Learning.
Why I’m doing this? What Problem do you want to Solve? Because this will help you to keep steady and constant on your path of learning anything. I would say. “ Do what you want to do/love to do, Otherwise you would end up in living have to do”. It might sounds cliche but be clear on your path of doing anything you want.

Just Follow your Heart. Never Stop Learning. Never Stop Growing.

Extravaganza (Bonus)

Decision Tree Use Cases

Building knowledge management platforms for customer service that improve first call resolution, average handling time, and customer satisfaction rates
In finance, forecasting future outcomes and assigning probabilities to those outcomes
Binomial option pricing predictions and real option analysis
Customer’s willingness to purchase a given product in a given setting, i.e. offline and online both
Product planning; for example, Gerber Products, Inc. used decision trees to decide whether to continue planning PVC for manufacturing toys or not
General business decision-making
Loan approval