Parallels between Linear Regression, Random Forests, and Scientific Theory

5 min readMar 2, 2020

If you google ‘linear regression vs. random forests’, you will find an enormous amount of information, as well as variety of explanations, concerning these two overarching methods of prediction. Most often, these comparisons cite the common advantages and disadvantages of each.

For linear regression, some advantages are:

Interpretability. Coefficients represent the contribution of a feature in determining a target variable in a way that is often explanatory and in line with common sense (such as an inch in height adding two pounds to a person on average).
Extrapolation. A linear regression uses a formula which can predict a continuous target variable using values far outside of the training set, since the model uses coefficients which can be multiplied by any input. I.e., it can predict anomalies.

Some disadvantages are:

Non-linearity. Linear regression can’t easily pick up on feature interactions or data not normally distributed, such as colinearity or skewness, without transformations or feature engineering.
Scaling and transformation. Data that is not normally distributed or not homoscedastic with the target variable must be transformed for the purpose of building the model, and often the data requires scaling.

These advantages and disadvantages of linear regression stem largely from the fact that it is a parametric kind of modeling. Linear regression makes the crucial assumption that there exist parameters which actually do represent a population, and even if our model’s parameters are likely to be imperfect, parameters still exist. It is a fundamentally different kind of model from random forests and decision trees, and it expresses a kind of faith in some explanatory connection between the training data and the test data (‘domain knowledge’ is important precisely because it justifies that faith via deeper understanding of features).

Michelangelo — ‘Creation of the Linear Model’

On the other hand, some advantages of random forests are:

Outperformance, plain and simple. Trees generally outperform linear regression on prediction, especially when the data has important underlying interactions and non-linear relationships.
No need to scale or transform. Trees do NOT require transformations or scaling, since they automatically pick up on non-linear relationships in the data (not through parametric assumptions but by aggregating bootstrapped trees from random subsets of features).

Some disadvantages are:

Complexity. Random forest is a black box. We can’t understand why a random forest attaches importance to one feature or another, and thus entirely sacrifice interpretability for the sake of predictive accuracy (we can’t see individual coefficients like we can in linear regression, or individual branches in a decision tree).
Overfitting and expense. Random forests tend to overfit, especially with greater tree depth, and training large amounts of data with random forests can be computationally expensive.

Dark Random Forests — Sacrificing Interpretability for Accuracy

So then, what is the point of this comparison? Does it actually help us or provide actionable advice for when to use LR or RF? Overall, which model is ‘better’? When should you use which? Well, the correct answer to those questions is, undoubtedly, it depends. It depends on what you’re looking for, whether you’re aiming for accuracy or intelligibility, or for minimizing computational expense, etc. etc. etc.

But that’s not really the point of this blog post. I just started learning these modeling techniques, and there are many professionals, both in business and research, around the world, far more experienced at using them effectively at a high level in a variety of fields.

What I am interested in is an interesting connection between these modeling techniques and the scientific method. Model building is as at least as old as pre-Socratic Greek philosophy, when Thales of Miletus posited that the underlying element of the universe was water.

It turns out that this water-model of the universe wasn’t very good at making predictions. On the other hand, our most rigorous modern model of the universe, the standard model, is extremely good at making predictions, but is nearly incomprehensible to anyone without a PhD in physics:

Even so, at a very high level of abstraction, we can connect this kind of model to linear regression, complete with constants and coefficients representing the contribution of various fundamental forces and particle properties to an overall prediction:

Once again, I don’t understand anything specific in the above model, but nevertheless, there is a connection I’m starting to make here, which is that the types of models we use in data science are simply a very specific subset of models of the world. The idea that a model is in fact capturing some true law or aspect of the world (whether in physics, biology, or the stock market) is, in the philosophy of science, called structural isomorphism.

This is not an immediately actionable insight, and it’s not going to help us build better predictions of stock prices, and indeed often such philosophical observations tend to be dismissed as time-wasting rabbit holes. However, the one true point I want to drive home is that when we say an advantage of linear regression is ‘interpretability’, that doesn’t exclusively mean ‘easy to intuitively understand’ or ‘less complex’. It also means that, unlike random forests, we are implicitly expressing faith in a connection between our model and some true structure in the world. This, to me, is a phenomenal virtue of linear regression, because unlike a black box which can get better predictions, at least linear regression models aspire to be explanatory theories of the world. They are, in other words, beautiful.

Parallels between Linear Regression, Random Forests, and Scientific Theory

Written by Brayton Hall