Decision Trees— Primer (Part 1)

Udit Nagar
The StereoDataType
Published in
5 min readMar 3, 2022

The gist of the article below is to make one understand the basics of Decision Tree Learning. It is a very habitual human tendency to get enthusiastic about various data science problems, but failing to fully grasp the basics to solve them. Generally, once we are stuck in a scenario, most of us would straight away go to the websites like stackoverflow and stackexchange to find whether someone else tackled a similar situation. Rather than thinking & applying our own solution, we use makeshift methods and short-cut tactics to handle things.

This is the Part 1 of the five article series in which I will cover basic jargon and splitting decision trees in depth. Part 2 of this series will cover the CART and ID3 methods to determine impurities and uncertainties in a decision tree. The third part will talk about how to deal with different categories of variables used in decision trees. The last two parts will deal with examples in Python and R respectively.

I have two aims for writing this series. First, I’d like to get confident on my hold of this topic. Second, help others who are stuck in a similar situation like me. Enough jibber-jabber now. Let’s get started.

What is a Decision Tree ?

A decision tree is a support tool that uses a tree-like graph over a set of observations to determine a particular conclusion. A decision tree looks something like this:

A sample decision tree

In the context of machine learning, Decision tree learning uses decision trees as a predictive modelling approach, where a set of observations about an item are used to conclude about the item’s target.

In the example above, if a person’s age is greater than 40, they are allowed to have a cheatday today or else they would have to go for a workout. Now, you know the official definition & structure of a decision tree. Moving further, in order to crack an interview or a viva exam, you need to be fully aware of the jargon behind the topics you’re covering. Let’s deep dive into the jargon of decision trees:

Decision Tree Nodes and Branches (Source Analytics Vidhya)

Node: Node represents an observation about a particular item. Contains the population or sample of that item.

Splitting Branches: Splitting Branches are drawn based on the conclusion made from the node.

Root Node: Root Node is the topmost node of the decision tree which would be the most important item on the basis of which the branches are split.

Decision Node: Decision nodes are the internal nodes which can be split into further nodes. They can have branches pointing to them and away from them.

Terminal Node: The final node which draws the conclusion of a particular condition of observation. Terminal Nodes can also be called Leaf Nodes. Note, unlike Decision nodes, terminal nodes have a branch pointing towards them but they can’t have branches pointing away from them.

Sub-Tree Branches: Sub-Tree Branches consists of an entire sub-section of a decision tree in order for anyone to perform a specific analysis.

Moving on to the next section on building the decision trees.

Splitting Decision Trees

While handling different datasets to perform machine learning, we need to identify the type of variable with which we are dealing. The most common categories of variables in supervised machine learning are:

  1. Binary Categories (Yes/No, 0/1 etc.)
  2. Ranked Categories (A/B/C/D, 1/2/3/4 etc.)
  3. Multiple-Choice Categories (Red/Blue/Green, May/June/July etc.)
  4. Numerical Categories (100000, 557, 223 etc.)

For our understanding, we will deal with Binary Categories. Suppose we have a sample dataset as follows:

Note: This table is only a sample out of a large dataset, the numbers for the decision tree below will by hypothetical

There are three independent variables (Cough, Fever and Sore Throat) which determine the target variable, i.e. whether the person has CoVid. In order to build a complete decision tree, we need to determine the best independent variable to separate the data and use it as a root node.

To start, let’s group an individual independent variable with the target variable. For example, the first patient has both Cough and CoVid, evaluating similarly for the rest of the patients, we can draw something like:

Cough and CoVid distribution grouped in a tree. Note: The numbers are hypothetical

Here, the root node is the item Cough, which is performing a binary split to determine how many patients have both cough and coVid. By the looks of it, most of the patients with Cough have CoVid, as denoted by the left decision node. From the right decision node, we can concur that most of the patients that do not have Cough, do not have CoVid. Similarly, drawing sub-trees by grouping the rest of the independent items individually with the target variable we get:

Fever and CoVid distribution grouped in a tree

Here, we can observe a similar pattern where the major number of patients having fever suffer from CoVid too. The left decision node denotes the patients who have fever and the right decision node denotes the patients who do not have fever.

Sore Throat and CoVid distribution grouped in a tree

Similarly, the distribution of Sore Throat and CoVid is given on the left. The left decision node denotes the patients who have sore throat and the right decision node denotes the patients who do not have a sore throat.

From the decision trees above we cannot conclude a single flow of separation, i.e. none of the decision nodes are 100% ‘Yes’ CoVid or 100% ‘No’ CoVid. This is the case of impurity. The data is said to be pure when there is a single flow of separation while splitting the data. Impurity is the measure of homogeneity of all the labels at the node. Unfortunately, in real world, all data is impure.

To determine the best root node among the three independent variables, we need to check which independent variable is the least impure or the least uncertain. There are multiple ways to determine this, and they will be available in Part 2 of this module: Decision Tree — CART & ID3 (Coming Soon)

Bonus Article: Decision Tree CheatSheet and Additional Jargons (Coming Soon)

To follow more of my projects, do follow my GitHub or connect with me on LinkedIn.

--

--

Udit Nagar
The StereoDataType

Machine Learning, Quantitative Finance, Photography, Football