Airton Kamdem
5 min readMar 7, 2022

DECISION TREES: “You can get with this, or you can get with that”

1990s Hip Hop Group behind the hit single, The Choice is Yours

Much like the 1991 acclaimed hip hop hit “The Choice is Yours” by Black Sheep, Decision Trees are all about making decisions. Decision trees, however, help us do this at scale based on data inputs with the help of computers. This blog will explore high-level processes that enable the magic of decision trees, but before we go any deeper, here is some helpful terminology to follow along.

Terminology

Root node: At the very top of the tree, all other nodes, trees, and leaves are started from this node.

Internal nodes: Nodes that have arrows pointing to and away from them.

Leaf nodes: Nodes that have arrows pointing to them but no arrows pointing away from them.

A Decision tree asks a true or false question of the data, and depending on the answer, it categorizes the data appropriately and proceeds to the next relevant question until you get to a point where you can’t go any further. With each question or layer, the model keeps track of how well the node in question or collection of nodes before it are at predicting the classification of a target. Unless the data is already perfectly aligned or predictive, this will lead to what are called impure features and classifications along the way. This impurity is then measured through a common calculation known as the impurity Gini coefficient:

Each feature receives such a coefficient. In the case of a heart disease prediction model for example, we could create such a classification/divider on a feature that affirms whether a patient has a family history of heart disease. The features with the best Gini coefficients naturally gets assigned internal or root node positions.

In the event of numeric data such as a Body Mass Index, the model will sort the data from lowest to highest by BMI, it will then calculate the average BMI for all adjacent patients, calculate the impurity value for each average BMI, such as the impurity value for BMIs over the “obesity” threshold and so forth. The range that has the lowest Gini impurity value is ultimately the cut off that is used in the decision tree, as a root or internal node.

For ranking and multiple-choice data types, there is also a similar classification process. Ranked data for example is treated just like numeric data, but instead of calculating impurity scores for a specific threshold, the model calculates the score for all ranks. Practically speaking, this breakdown could look like:

Node 1: Customer orders with one or fewer stars

Node 2: Customer orders with two or fewer stars

Node 3: Customer orders with three or fewer stars

Node 4: Customer orders with Four or fewer stars

Node 5: Customer rankings node with 5 or less stars is not needed as it could include everyone

In the case of multiple choices, such as a customer ordering, seafood, Thai food or TexMex, the model calculates an impurity score for each one as well as each potential combination of options.

Node 1: Food choice: Seafood

Node 2: Food choice: American

Node 3: Food choice: TexMex

Node 5: Food choice: Seafood or American

Node 6: Food choice: Seafood or TexMex

Node 7: Food choice: American or Seafood

Node 8: Seafood or American or TexMex is not needed at it would include all potential orders

RANDOM FOREST:

Random Forests are the natural next step to decision trees, so an exploration of random forests must be built on decisions trees. Overall, decision trees tend to be easier to build, interpret and deploy but they can struggle to generalize to unseen data as they often lack flexibility. This is where random forests provide a unique value add. Random forests combine the simplicity of decision trees with flexibility, leading to significant improvements inaccuracy.

The first step in creating a random forest is to generate a bootstrapped dataset. Doing so in a manner that represents the size and distribution of the original requires randomly selecting samples from the source data frame — please note that one sample can be selected and added to the bootstrapped data set more than once. The second step in this process creates a decision tree using our bootstrapped data while only using random subsets of variables/columns at each step. For example, in a dataset about predicting customer satisfaction, we could start by only selecting ‘order total’ and ‘party size’ among dozens of other features as candidates for the root node. Once selected, we follow a similar random selection process for the ensuing child nodes, and progress to create a deepening tree structure using this random selection process.

After completing this structure for the entire bootstrapped dataset, the model will have one tree structure, and return to the beginning to create another randomized tree structure using this same process. Because this selection and buildout process is built on random selection, no two individual trees should map out the same exact set of decisions — the model does this hundreds of times, introducing a broad amount of variety to the model. As you can imagine, this increased level of exposure to randomized trees makes the model much more effective than single, predictably selected decision trees.

The above process results in the formation of what we effectively call a random forest. Using this forest, once we have a new data sample introduced, the model takes the data and runs it through the first tree made, generates its prediction, keeps track of it, and we then run this sample through the second tree, keep track of its prediction and so forth for every single tree that was developed. From there the model observes which options received the most votes and uses this classification as its ultimate prediction. This process is often referred to as Bagging — bootstrapping and using the aggregate to predict or classify is known as Bagging.

EVALUATING

You may have noted that the bootstrapped dataset created earlier included duplicates from the original, this usually results in about a third of the samples in the original data set not making it to the bootstrapped data. Entries that don’t make it to the bootstrapped dataset are often appropriately called, “out-of-bag datasets”. Since this data wasn’t used to train or classify our trees, the model can effectively run this through itself and evaluate with reasonable certainty just how accurate the random trees are at generalizing to unseen data — the label with the most votes wins, and all of the out-of-bag samples are pushed through the model in a similar fashion. The proportion of these samples that are incorrectly classified helps us determine overall accuracy. At this stage, we could even return to the initial steps and initiate a new set of trees using a different number of randomly selected variables to start and evaluate our out-of-bag samples against them. Doing several different permutations of this process is part of what contributes to the overall flexibility and accuracy of Random Forest Models at least as it compares to Decision Trees.