Can’t Decide Between a Linear Regression or a Random Forest? Here, Let Me Help.

A Brief Guide for Choosing the Right Model for Your Business Problem

Harpreet Sahota
Artificialis
Published in
5 min readJan 27, 2022

--

Photo by Geran de Klerk on Unsplash

It’s the first week on the job and you finally get to do machine learning in the real world!

Your stakeholders need your help with problem they’ve been trying to solve with little success. You ask a lot of questions, making sure that you have translated the business problem to a machine learning problem. You conclude that this is a classic regression problem. Now you must find the right model to represent the data generating process.

Assuming you’ve already built a baseline model, choosing the right model depends on a number of factors.

I’ve been a machine learning practitioner since 2017, and since then I’ve made many bad model choices. I’ve use neural nets when XGBoost would have been good enough, used ensemble methods when a linear model would have worked just fine, and employed linear regression when simple business heuristics could have done a better job.

I don’t want you to make the same mistakes I’ve made.

In this blog post I’ll shed light on TWO considerations so that you could quickly arrive at the best decision for your use case: your data and the assumptions you’re willing to make — all within the context of linear regression and random forests.

What’s your data like?

Photo by Luke Chesser on Unsplash

You need data to build a machine learning model.

Without data you won’t be able to discover relationships or identify patterns. Which is needed to build a machine learning model. Once you’ve got your data you should start to question and examine the nature of your input (features) and what you’re trying to model (your target), as well as the relationship between them.

You should consider a few important aspects of the data as you explore it, specifically as it relates to the modeling process.

Number of examples

Linear models are a lot faster to train than random forest models.

I was once working on a data set that had 10 million rows. It was my first industrial application of machine learning and I had never worked on a data set that large before. The first model I tried to train: a random forest.

Although the default implementation fit in a reasonable time, the time to train quickly escalated when I started to tune hyperparameters.

I let RandomizedSearchCV run overnight only to see that four models had fully fitted — and that was with an i9 processor. Linear regression took a full 3 minutes. Lasso regression and tuning the alpha hyperparameter was also quick work.

Be mindful of the size of your data set.

Number of features

With linear regression your results can be compromised if you include too many variables.

If you put enough junk variables in a regression equation, one of them is bound to meet the threshold for statistical significance just by chance. The thing with junk variables is that they aren’t always recognized as junky.

Random forests tend to be a bit indifferent to the number of features you pass. If you create a large enough forest, on an average, the junky combinations of features will be eliminated ,and the more decent performing features get called to action and improve the accuracy.

Other data consideration:

  • How many categorical features (don’t fall into the dummy variable trap, this won’t play well with linear regression)
  • How correlated are your features (linear regression can blow up if you have multicollinearity, random forest doesn’t mind as much)
  • Check if your features need to be scaled (random forest is scale invariant, linear regression is not)
  • Check for outliers in the target (linear regression will be more sensitive to this than random forest)

In general, if the relationship between your target and features is clear and easy to understand, opt for a linear regression. If you see a complex non-linear relationship, then opt for a random forest.

What assumptions are you willing to make?

Source

Before we talk about assumptions, we need to distinguish between parametric and non-parametric models.

In a nutshell:

  • Parametric models make assumptions regarding the data distribution and has parameters that are inferred from the data
  • Non-parametric models have parameters related with the algorithm itself, but they don’t need assumptions about your data distribution

Linear regression is a parametric model.

And parametric models require you to make assumptions. If you aren’t willing to make those assumptions, then don’t bother using that methodology. Listed below are the six, fairly strict, requirements for linear regression

Source: Author

If these assumptions aren’t satisfied or you’re not willing to make them, then it’s a sign that you should move to a random forest.

Conclusion

We could go on for another 1000 words talking about when to use what models.

Something to consider further is explainability/interpretability. You can interpret what linear regression is doing just by looking at the equation’s coefficients. For random forest you’d have to use something like SHAP, which isn’t that intuitive and could be hard to explain.

Another important consideration is around generalizability. Random forests fail to generalize well if the problem space is not stable and simple. They don’t respond well to drift and are sensitive to small changes in data during training. Which could lead to a trained model that doesn’t do well in production.

There are a number of tradeoffs that you have to be willing to make, and I hope this post has helped you understand some of them.

Congratulations! You’ve made it to the end of the post.

Here are a couple of handy flash cards describing linear regression and random forest, as well as their advantages and disadvantages.

Feel free to keep, share, or ignore them!

Source: Author
Source: Author

--

--

Harpreet Sahota
Artificialis

🤖 Generative AI Hacker | 👨🏽‍💻 AI Engineer | Hacker-in- Residence at Voxel 51