What is a Decision Tree?

For a bank to consider whether or not to offer someone a loan they often go through a sequential list of questions to figure out if it is safe to give said loan to an individual. Those questions can start as simple as what kind of income does the person have? If it is between $30–70k they move on to the next question. How long have they held their current job? If 1–5 years it leads to their next question of do they make their credit card payments? If yes then they offer the Loan and if no they do not. This process at its most basic form is a Decision Tree.

A decision tree is a largely used non-parametric effective machine learning modeling technique for regression and classification problems. To find solutions a decision tree makes sequential, hierarchical decision about the outcomes variable based on the predictor data.

So what does all that mean?

Hierarchical means the model is defined by a series of questions that lead to a class label or a value when applied to any observation. Once set up the model acts like a protocol in a series of “if this occurs than this occurs” conditions that produce a specific result from the input data.

A Non-parametric method means that there are no underling assumptions about the distributions of the errors or the data. It basically means that the model is constructed based upon the observed data.

Decision tree models where the target variable uses a discrete set of values are classified as Classification Trees. In these trees each node, or leaf, represent class labels while the branches represent conjunctions of features leading to class labels. A decision tree where the target variable takes a continuous value, usually numbers, are called Regression Trees. The two types are commonly referred to together at CART (Classification and Regression Tree).

Each CART model is a case of a Directed Acyclic Graph. These graphs have nodes representing decision points about the main variable given the predictor and edges are the connections between the nodes. In the Loan scenario above the $30-$7ok would be an edge and the “Years Present in Job” are nodes.

As the goal of a decision tree is that it makes the optimal choice at the end of each node it needs an algorithm that is capable of doing just that. That algorithm is known as Hunt’s algorithm, which is both greedy, and recursive. Greedy meaning that at step it makes the most optimal decision and recursive meaning it splits the larger question into smaller questions and resolves them the same way. The decision to split at each node is made according to the metric called purity. A node is 100% impure when a node is split evenly 50/50 and 100% pure when all of its data belongs to a single class.

In order to optimize our model we need to reach maximum purity and avoid impurity. To measure this we use the Gini impurity, which measures how often a randomly chosen element is labeled incorrectly if it was randomly labeled according to distribution. It is calculated by adding the probability, pi, of an item with the label, i, being chosen multiplied by the times the probability (1–pi) of a mistake categorizing the time. Our goal is to have it reach 0 where it will be minimally impure and maximumally pure falling into one category.

The other metric used is information gain, which is used to decide what feature to split at each step in the tree. This is calculated in the following way in a nicely laid out equation made by Wikipedia,

Information Gain = Entropy(parent) - Weighted Sum of Entropy(Children).

While this is a great model it does present a large problem by resulting in a model that only stops when all the information is in a single class or attribute. At the expense of bias the variance for this model is massive and will definitely lead to over fitting. “Decision-tree learners can create over-complex trees that do not generalize well from the training data.” So how do web combat this. We can either set a maximum depth of the decision tree (i.e. how many nodes deep it will go (the Loan Tree above has a depth of 3) and/or an alternative is to specify minimum number of data points needed to make a split each decision.

What are the other disadvantages does a Decision Tree have: It is locally optimized using a greedy algorithm where we cannot guarantee a return to the globally optimal decision tree. It is an incredibly biased model if a single class takes unless a dataset is balanced before putting it in a tree.

While there are disadvantages there are many advantages to Decision Trees.

They are incredibly simple to understand to understand due to their visual representation, they require very little data, can handle qualitative and quantitative data, it can be validated using statistical sets, it can handle large amounts of data and it is quite computationally inexpensive.

I hope this article has helped you better understand Decisions Trees. For the coding and more knowledge about them I highly encourage you to check out Scikit-Learns document on Decision Trees.