Decision Tree Classifiers.

9 min readFeb 24, 2023

What are Decision Trees?

Consider this scenario. You’re a medical researcher compiling data for a study. You’ve already collected data on a set of patients. All of which have suffered a particular illness. During their course of treatment, each patient has responded to one of three medications, drug A, drug B, or drug C. As a researcher, part of your job is to build a model to find out which drug might be appropriate for future patients with the same illness. In the world of Data Science, this can be termed a classification problem because we will need to classify the patients under a particular drug they have responded to. By using the right approach, a classification algorithm can help us arrive at amazing results. A Decision Tree Classifier is a supervised learning algorithm, which can be utilized for both classification and regression tasks. It has a hierarchical tree structure, which consists of a root node, branches, internal nodes, and leaf nodes.

The above image contains a perfect example of a Decision Tree which classifies an individual as fit or unfit, based on a few questions. The process it takes to arrive at a decision is stated below:

First, check if the age is less than 30. If the age is less than 30, check if the individual eats pizza. If yes, then the individual is unfit, and if no, he/she is fit. If the person’s age is greater than 30, check if they exercise. If yes, then the person is fit, and if no, the person is unfit. Congratulations! Now you have a clear picture of what Decision Trees are, and how they work. Let’s revisit the medical researcher context.

The picture above depicts sample data of patients, their medical conditions, and the type of drug they have responded to during treatment. To build our model, let’s start with a quick feature selection process. The Age, Sex, BP, and Cholesterol features are all valid candidates that can help predict our target variable, which will be the type of drug that should be administered. In essence, building this model will help us to choose the optimal drug to prescribe for a new patient based on their age, sex, blood pressure, and Cholesterol levels.

How is a Decision Tree Classifier Built?

It is built by splitting a training data set into distinct nodes. In the diagram above, we first consider a given patient’s age. If the patient is middle-aged, we straight away prescribe drug X. If the patient is young, then we go ahead to check the sex of the patient. If the sex is male, drug C is prescribed. If female, drug Y is prescribed. On the other hand, if the patient is old, we go ahead to check the Cholesterol levels of the patient. If Cholesterol levels are high, we prescribe drug Y and if low, prescribe drug C. This flow of logic is the underlying concept of Decision Tree algorithms.

Leaf Node: Assigns a classification.

Branch: Each branch corresponds to the results of a test.

Internal Node: Test case for a particular condition.

Node: The point of initiation of the Decision Tree. It represents is the first test case we conduct.

The process flow of a Decision Tree Algorithm:

1. Choose an attribute from your dataset.

2. Calculate the significance level of the attribute in the data-splitting process to see if it’s effective or not.

3. Split the data based on the value of the best attribute.

4. Go back to step one, and repeat subsequent steps for the remaining attributes

Nb: A very important factor in Decision Trees algorithms is determining the optimal attribute to split data on.

Scenarios for attribute selection; both bad and optimal choices.

Let’s consider Cholesterol as our chosen attribute to start the Decision Tree. For this scenario, we have a total data set of 14 cases of which drug B has been administered in 9, and drug C in 5. Splitting Cholesterol further results in two categories (High and Low Cholesterol) as shown below.

After splitting, we can’t tell with certainty whether drug B or drug A, should be administered in both cases. Let me explain why this is the case. After splitting the Cholesterol attribute further, we realize that, in 5 cases of patients with high Cholesterol, drug B has been administered. And in 3 of the cases, drug A has been administered. This is not sufficient evidence to support administering drugs B or A to patients with high cases of Cholesterol. We cannot just state that since drug B has been administered in most of the cases then we should just go with it. We need to be certain because we’re dealing with human lives here. This is the same for patients with low levels of Cholesterol. We cannot also state with certainty that drugs A or B should be administered for the same reasons. Understand now? Great! Let’s consider another attribute (Sex).

Let’s stick with 14 cases. The sex attribute can be split further into two classes. Either female or male. let’s do that split quickly and have a look at what it results in.

After the split, we have 7 cases in which females have been administered drug B and just one case where they have been administered drug A. For females, we have obtained purer results as compared to the Cholesterol attribute from this split. From the results of the split for females, we can state with a higher level of confidence that drug B should be administered if the patient is female. However, in the case of males, we still do not have enough evidence to support whether drugs A or B should be administered. An interesting factor in dealing with Decision Trees is: starting with the right attributes, we can always split further down to achieve purer results that can be used to support a hypothesis with certainty. Let’s split the male category further down to see if we would find something interesting. I’m curious. Curiosity kills the cat, but in this case, it won’t. Trust me!

We now test the male patients on the Cholesterol attribute. Realize we now have much purer results? Great. We can now state with a high level of confidence that for males with high levels of Cholesterol, drug A should be administered, and for those with normal levels of Cholesterol, drug B should be administered. A node in the tree is considered pure if, in 100 percent of the cases, the nodes fall into a specific category of the target field. The Decision Tree algorithm uses recursive partitioning to split the training records into segments by minimizing the impurity at each step. The level of impurity in a node is calculated using the Entropy of data in a node. Well, that definitely is a new word in this article. Yes! And without wasting time, let’s define what Entropy is.

Entropy can be defined as a measure of uncertainty or randomness.

From the diagram above, if all of the data in a given node is either Drug A or Drug B, then Entropy is low. It is so because in that case, we can be certain about reaching decisions on whether drug A or B is best. Uncertainty is low in such a scenario, and hence Entropy is low. However, If there are equal numbers of drugs A and B in a given node, Entropy becomes high because we get uncertain about which drug to administer. Take note that we always desire to have a situation whereby based on the data in a node, we can be certain about decisions reached.

Calculating Entropy:

Entropy = -p(A) log₂(p(A)) -p(B) log₂(p(B))

Where p is the proportion of a category such as a drug A or B. We will take an example of how to calculate the Entropy of a given node so hang on tight as I make the concepts clearer.

Low values of Entropy mean purer nodes, and high values correspond to impure nodes. In Decision Trees, we always look out for trees that have the lowest Entropy in their nodes. However, take note that we do not have to manually calculate Entropy, as the library used to implement the Decision Tree algorithm will automatically calculate Entropy for us.

A Quick Example Of Calculating Entropy In A Node

In the above diagram, we have chosen to start splitting up our features on the Cholesterol attribute. Take note that just like in the examples above, we are still sticking with 14 cases. Calculating the Entropy of the Cholesterol node before the split for the node is done as follows:

p(Drug A) = 5/14

p(Drug B) = 9/14

Entropy = -(9/14) log₂(p(9/14)) -(5/14) log₂(p(5/14)) = 0.940

After splitting further, we have 8 patient cases in total where both drugs A and B were administered for normal Cholesterol levels. Let’s calculate the Entropy for that node as well.

Entropy = -(6/8) log₂(p(6/8)) -(2/8) log₂(p(2/8)) = 0.811

It is apparent that Entropy after the split is lower and better that before the split but, an Entropy level of 0.811 is still high and undesirable. Let’s investigate another attribute to see if we get a better level of Entropy after one split.

From the diagram above, results from the first split have shown a lower level of Entropy for the male category. This definitely is some information that can be very useful in selecting the optimal attribute to start our Decision Tree with. Now the million-dollar question here is which attribute is a better choice for initiating our Decision Tree?

The answer to that question is: the Decision Tree with higher Information Gain after splitting. The term Information Gain definitely sounds foreign since we haven’t touched on it. Don’t worry. I got you covered! Information gain can be described as the information that can increase the level of certainty after splitting is performed on an attribute. Information Gain is calculated as shown below:

Information Gain = (entropy before split) — (weighted entropy after split)

The relationship between Information Gain and Entropy can be described as inversely proportional. As Entropy decreases, Information Gain increases, and as Entropy increases, Information Gain decreases. Let’s calculate the Information Gain for both attributes (sex and Cholesterol).

For the sex attribute,

Entropy before the split = 0.940

Weighted Entropy after the split = [(7/14)(0.985) + (7/14)(0.592)]

Information Gain = 0.940 — [(7/14)(0.985) + (7/14)(0.592)] = 0.151

For the Cholesterol attribute:

Following the same process, Information Gain for Cholesterol will result in:

0.940 — [(8/14)(0.811) + (6/14)(1)] = 0.048

From our results, the tree with the highest Information Gain is the one that starts with the sex attribute. This tells us that the sex attribute is a better choice over the Cholesterol attribute for initiating our Decision Tree.

Time to get our hands dirty. Find below the python code to implement a Decision Tree that recommends drug prescriptions.

Link to the data set: here

Congratulations on making it this far. Connect with me on LinkedIn here and feel free to leave any questions in my inbox.

Thank you!

Decision Tree Classifiers.

Written by David K