How to Improve Decision Tree Performance with PCA

Sam Erickson
3 min readFeb 18, 2024

Decision and classification trees (CART), and methods derived from them such as random forests and even gradient boosted trees do very well on a variety of problem types. However, CART-based models suffer from a major problem: they fail to efficiently fit to data that imperfectly aligns with its axes.

What does this mean? Well first let’s be more clear about what it means for data to imperfectly align with its axes. The following are 2D plots of data that does align with the x-axis or the y-axis perfectly:

Note how we can subtract 2 from the left plot for it to become the x-axis, and subtract 1 from the right plot for the line to become the y-axis. Thus it is aligned. More complex data that would be considered aligned with an axis would be data from a step function, like the following:

Data that does not perfectly align with an axis is illustrated in the following:

In the above examples, the data does not perfectly align with either the x-axis or the y-axis because we cannot translate all or even parts of the curves to an axis.

CART-based models do best when they are used with data that perfectly aligns with some of the axes, because the rules at each internal node break the space along each axis in the first place, kind of like in the following:

Credit: https://towardsdatascience.com/decision-tree-models-934474910aec

So what happens when data is used that does not align well with an axis? If you are using gradient boosting or random forests, it turns out that the model will still probably perform well, but the splitting decisions get unnecessarily complex, and this may cause the tree to grow too large. In the case of linear data that doesn’t align to an axis, it is desirable to perform a rotation to make the data align perfectly with the axis. It may also be desirable to reduce the dimensionality with PCA, as PCA will force the data to be perfectly aligned with each of the principal components, as the following figure illustrates:

Credit: https://www.baeldung.com/cs/principal-component-analysis

--

--