Optimal Classification Trees Paper Summary & Analysis
- Utilize the incredible speed-up (800 billion factor) in mixed-integer optimization enabled by hardware improvements (Although they were deemed too expensive to be practical in the past, the paper shows that these methods are practically solvable on real-world datasets and even outperform classification and regression trees (CART))
A decision tree is a flowchart-like structure where every node represents a “test” on an attribute, each branch represents the outcome of a test, and each leaf node represents a class label, or the decision taken after considering all attributes.
Because they are created using a top-down approach, the first splits are made with no regard to future splits, meaning they are rarely optimal. To address this, a pruning step is required after creating the tree. Optimizing decision trees is NP-hard, meaning that it has not been very explored before.
This paper uses mixed integer optimization to optimize the tree, a technique that is an integer program, with one or more variables constrained as integers.
MIO programming was computationally expensive in the past, but solvers such as Gurobi and CPLEX have improved greatly over the last decade or two, meaning MIO is more useful than ever.
The authors, motivated by rapid development of optimization theory and hardware improvements, present an algorithm that utilizes mixed-integer optimization (a linear program in which 1 or more of the variables are constrained to be integers).
The researchers revisited the classical optimal decision tree creation problem using the state-of-the-art MIO formulation approach.
Previously, to find an optimal decision tree, the most common approach is through heuristics such as top down induction and pruning, and no effective algorithm that runs within time constraint is proposed.
By relaxing constraints, the resulting classification method — Optimal Classification Trees with hyperplanes — is easy to train.
The paper assessed its results on a variety of synthetically-generated datasets by comparing the performance of optimal axis-aligned trees with normal decision trees, the performance of linear-split optimal decision trees with XGBoost and random forest. They used accuracy as their metric of comparison. Below is a plot showing the performance of the method averaged across 60 different tasks:
Potential Further Research and Applications
This paper could add further discussion about the computational limitations of this method. The paper mentions that this method is feasible for datasets in the 1000s, but has very little specific analysis on where this number emerged from and what the hard boundaries are.
Further research could also look into improving the accuracy of these models — many applications would benefit from a model that can clearly delineate the reasoning behind its diagnosis, but they may require a greater degree of performance across the board than what the model currently provides. Additionally, future research could look into improving its performance in lower-dimensional problems, as these problems could also benefit from interpretable results.
One other potential direction could involve looking into alternative methods other than top-down induction to optimize splits so that the influence of future splits can be taken into consideration. Though this may be much more computationally expensive, the greedy approach of top-down induction can often lead to trees that don’t capture the underlying characteristics of the data.
Decision trees are useful in that they provide transparency into the learned rule, which is desired for many real-world applications such as in finance and healthcare. Decision trees with axis-aligned splits excel at this because we can follow the path and give an easily interpretable boundary for each decision (ex. price/earning < 5 with revenue > 100,000 = buy the stock). With the improvements set forth by this paper, we can create optimal decision trees more efficiently with MIO to capture the patterns of the underlying dataset easier than other algorithms which may prioritize weaker splits and require significant pruning. The results of their experiments show that creating optimal decision trees is tractable and can be used in practice.