Working behind DECISION TREES — Easy Explanation

Sukriti Macker
Analytics Vidhya
Published in
4 min readSep 2, 2021
Everything you need to know about DECISION TREES!

Decision Tree is a Classification Algorithm used in Machine Learning. It is versatile enough as it can be used to solve Regression problems as well. It is a tree or graph-like structure with a root node, non-leaf nodes, leaf nodes and branches.

Here is an example of a Decision Tree using the AND Operation:-

Decision Tree created using AND Operation

There are some important terms that one must know while working around Decision Trees:-

  1. Root Node: The root node is the starting point of the decision tree. It represents the population of the decision tree. A population is the data we are using to solve the problem. The root node gets split on the features (column/attribute) values.
  2. Non-Leaf Node: When a node splits into sub-parts (on a feature value), the nodes formed by the split is called the Non-Leaf nodes.
  3. Leaf Node: A node which cannot be split further (on any feature value) is called the Leaf Node.
  4. Splitting: Splitting is dividing the nodes into two or more sub-nodes based on the outcome. In the “Yes” or “No” case, a node will split into two sub-nodes, one pointing towards the outcome that came from YES & the other towards the outcome that came from NO.
  5. Impurity: When a node doesn’t have a clear outcome or doesn’t belong to a single target class/output class, whether it be Yes or No class, 0 or 1 class, the node is considered as impure. If the node has all the values of a single class, it is called a pure node.

HOW THE NODE & HOW DO WE KNOW WHICH FEATURE TO SPLIT THE NODE ON?

To select the feature for the split, we calculate the impurity of the node. We want to select the feature which gives us the lowest impurity.

Lowest impurity means that the split will produce “less” impure nodes.

The aim is to reach pure nodes, where each node belongs to a single output class.

How do we calculate impurity?

The most common method is GINI INDEX. It is used for binary split, that is, for an output which has two classes only. Other measures include:-

  1. Accuracy
  2. Information Gain
  3. Gain Ratio

Let’s focus on GINI INDEX to understand the concept of splitting the node using a feature.

How do we calculate Gini Index?

Gini Index = 1- square of(Probability of “Yes”)- square of(Probability of “No)

Formula for Gini Index

Let us now take a dataset to understand how to choose the feature to split the node on?

We have a dataset with Independent Variables or Features such as Chest Pain, Good Blood Circulation & Blocked Arteries, determining whether the person has a Heart Disease or not. Heart Disease is our target class or a dependent variable. Therefore, the dataset has the target classes as Yes or No.

Sample Dataset

In the dataset above, we have four features: Chest Pain, Good Blood Circulation & Blocked Arteries. Out of the four features, we can make any one as the root node. To decide which node to make the root, we calculate the Gini Index and pick the lowest impurity feature.

Assume, we pick Chest Pain as the root node feature:-

  • If Chest Pain is YES -> 105 People have Heart Disease, and 39 don’t have Heart Disease.
  • If Chest Pain is NO -> 34 People have Heart Disease, and 125 don’t have Heart Disease.
Possible candidates (features) for the position of ROOT NODE.

Now, calculating Gini Index for Chest Pain Feature:-

Gini Index for when Chest Pain is YES = 1- square of(Probability of Heart Disease Yes)- square of(Probability of Heart Disease No)

= 1- square of(105/105+39)- square of(39/105+39)

= 0.395

Using the same formula, Gini Index for when Chest Pain is No = 0.336

Now, we calculate the Total Gini Index for Yes & No Chest Pain.

Total Gini Index = Weighted average of Gini Impurities for leaf nodes

Total G.I. for Chest Pain Node = [(Chest Pain Yes/Total Chest Pain Patients) x G.I of Chest Pain Yes] + [(Chest Pain No/Total Chest Pain Patients) x G.I of Chest Pain No]

= [(144/144+159) x 0.395] + [(159/144+159) x 0.336)]

= 0.364

The impurity for Chest Pain feature calculated using Gini Index is 0.364

After calculating the Gini Index for each feature like we did for Chest Pain feature, we would find that the Good Blood Circulation feature has the lowest impurity. Thus, we select the Good Blood Circulation feature to be the root node.

Similarly, we continue the process to find the next node and complete our Decision Tree.

--

--

Sukriti Macker
Analytics Vidhya

Data Science | Machine Learning | Researcher | Blogger | Connect on Twitter: @Sukriti_Macker