The Best of Both Worlds: Linear Model Trees

Published in

Convoy Tech

5 min readMar 14, 2018

The linear model tree (LMT) is one of my favorite ML models — and for good reasons. Linear model trees combine linear models and decision trees to create a hybrid model that produces better predictions and leads to better insights than either model alone. A linear model tree is simply a decision tree with linear models at its nodes. This can be seen as a piecewise linear model with knots learned via a decision tree algorithm. LMTs can be used for regression problems (e.g. with linear regression models instead of population means) or classification problems (e.g. with logistic regression instead of population modes).

Above is a heuristic chart of machine learning models along the axes of accuracy and interpretability. The upper-right quadrant is the best, with both high performance and high interpretability. This chart shows that LMTs are both highly interpretable and highly performant.

My usual tools of trade are Python, scikit-learn, and pandas. However, scikit-learn does not include an implementation of LMT, and I could not find an open source version, so I implemented it myself for our use at Convoy. The implementation is linked at the bottom of this post. In the remainder of the post we’ll compare LMT to other models listed above in the framework plot.

LMT vs. Others

Below we will demonstrate LMTs with the open source auto-mpg dataset. The auto-mpg dataset concerns the fuel consumption of 398 vehicles from the 1970s and early 1980s. We will predict fuel consumption (mpg) based on vehicle weight, model year, horsepower, acceleration, engine displacement and number of cylinders. The jupyter notebook linked at the bottom of this post contains the full exploration of this data and model building. The results will be summarized here.

The above table shows the performance of four different algorithms at the task of predicting mpg on this dataset. It is no surprise that Gradient Boosting Trees (GBT) performs best, as this algorithm often produces the best predictive performance. However, LMT performs very nearly as well, and as we will see below, it has other benefits. Linear regression and a single decision tree perform poorly compared to the other two models.

LMT vs. GBT

GBT did a great job in predictive performance with MSE. The next question is what drives gas mileage of the cars? We dive into this with variable importance on the GBT model and get the following:

GBT’s variable importance attribute tells us that weight is the most important feature, followed by horsepower, acceleration, displacement, and model_year, which are all similar. Unfortunately, GBT does not tell us anything about the numerical magnitude or sign of their impact, nor relationship of these features.

The LMT produces just 2 splits, for a total of 3 leaf nodes. It splits first at horsepower = 78, and for horsepower >= 78 it splits at horsepower = 97. We will call the three subpopulations low power, medium power, and high power.

Inspecting the weights from the linear model tree gives us a very different understanding of what affects fuel efficiency than we got from the other models. While there are some commonalities across the different subpopulations that our LMT has identified, we also see some significant differences.

It’s important to note that the features behave differently in range and distribution between the low, medium, and high power groups. Refer to the below graphic and then compare their distributions relative to the feature importances per group above. The x-axis for each of the columns is fixed to make it easy to compare the different distributions.

For all vehicles, weight has a large negative impact, which makes sense because fuel economy should get worse with the more mass the vehicle has to move. Model year has a large positive impact for all vehicles; presumably engine technology improved significantly in this period. These are similar to what we see in the single linear model. However, the magnitude of those impacts changes across the subpopulations that our LMT has identified, and engine size and power have different magnitudes of effect in different subpopulations.

For low-power vehicles, model year has a huge positive impact, and we see that in this population fuel economy is very sensitive to engine displacement.

In the medium-power category, weight again has a huge negative impact but fuel economy only increases moderately with model year.

For vehicles with high-power, weight has as much less significant impact, and the same can be said for model year. The engine size and power features are more relevant in this population by comparison to weight and model year. In this population the engine sizes are far more variable than in the other populations, so engine size ends up having a larger impact on the prediction than even what the coefficients tell us.

Summary of LMT benefits

For my final words on Linear Model Trees, here is a summary of their benefits:

LMTs are powerfully interpretable. Get insights into linear and non-linear relationships in your data. This can lead to other modeling hypotheses or product ideas.
LMTs identify subpopulations with different behavior.
LMTs can easily identify and utilize linear relationships. Tree-based models (including Random Forests and Gradient Boosting Trees) take a lot of effort to learn a line because they fit a piecewise constant model by predicting the average of all observations in each leaf node. Therefore they require many splits to approximate a linear relationship. Some examples of common linear relationships include: customers spend this month is probably a function of their spend last month, sales this month probably a function of sales last month, cost a function of size, in trucking, $/mile.
Overfitting (high variance) can be avoided by using cross-validation to optimize the minimum node size and maximum tree depth.
LMTs can work well with a modest amount of data (compared to many nonlinear models)
LMTs often produce simple models that are easy to implement in a production system, even if that system is not written in the same language that you use for modeling

The Best of Both Worlds: Linear Model Trees

Written by Logan Dillard