Random Forest vs Logistic Regression

Comparison of the algorithms

Bemali Wickramanayake
5 min readApr 17, 2020

In the space of classification problems in Machine learning, Random Forest and Logistic Regression are two totally beginner-friendly and very popular algorithms.

First, let’s understand what is a classification problem

A classification problem is simply when you need to classify an observation into one of the pre-defined categories, depending on the features of that observation.

E.g: Predicting if a customer will take up your next offer

  • Observation — The customer
  • Pre-defined categories — Will take up the offer (1) / Will not take up the offer (0)
  • Features — What we already know about the customer. His past purchase patterns, his demographic information etc

Overview of the algorithms

Random Forest

An extension of a simple decision tree, the only difference being this algorithm provides the combined result of many such trees, hence the word ‘Forest’.

A single decision tree looks at all the features by itself to classify the observation. But each tree in the Random Forest model will only look at a randomly selected subset of the complete feature set to conclude, hence the word ‘Random’.

What improves the performance of a Random Forest model against a traditional decision tree model is that, by randomly selecting subsets of features, some trees of the forest can isolate more important features while increasing the overall accuracy of the result.

Logistic Regression

Logistic Regression will not predict the exact category your observation should be in, but gives you a probability of each observation falls into the category ‘1’.

The probability is predicted via a simple mathematical calculation which looks like follows.

P (Probability of being ‘1’) = 1/(1+ e-z) where;

Z = C + a1X1 + a2X2 + …. + anXn

Where X1, X2 …., Xn being features of the observation, and a1, a2, …, an are the ‘weights’ of each feature. Higher the weight of a feature, the more prominent the feature in making the final decision.

Let’s review how each of the models behave in different contexts

Availability of the algorithms to use

If you’re using python, both the algorithms are available to use readily in scikit-learn (https://scikit-learn.org/stable/) library.

Handling Categorical features

Categorical features are those which classify each observation into a different finite set of categories.

E.g: Gender, Country of origin

Most of these features are in ‘text’ form as raw observations, but both the above models accept only numerical data.

Random Forest — Encoding each category with a numerical value will allow the model to perform with the categorical features

Logistic Regression — Since Logistic Regression depends on a calculation based on ‘weights’, numerical encoding of categorical variables can lead the algorithm to treat certain categories are of higher importance compared to others, depending on the number assigned.

E.g: Let’s say we need to classify if a fruit is poisonous or not based on a set of features, which include numerical features such as ‘diameter of the stone (seed)’, ‘thickness of the skin’, ‘time taken to fully ripe’ and categorical features such as ‘color of the skin’.

Let’s take:

Target = 1 if poisonous, 0 if not

Colour of the skin Ꞓ Red, Green, Yellow

If we encode them as

  • Red = 1
  • Green = 2
  • Yellow = 3

Then, the model will assume the color being yellow will have a higher chance to make the fruit poisonous, solely due to the value we assume.

To avoid above, in logistic regression (and other ‘weight’ based algorithms) we use a method called ‘one-hot-encoding’. The process involves creating new columns representing each category, where the column value will be ‘1’ if the observation falls into that category.

Below is how the raw data will look like

And below is one-hot-encoded data

Ability to extrapolate

Random Forest performs well if the values of the numerical features of the test data is within or close to the range of training data. However, it fails to classify correctly if the test data is far outside the training data.

On the contrary, Logistic regression performs well even if the numerical features of test data are well outside the range of the training data, because it is developed on an arithmetic function.

Flexibility in classifying the result

The output of the Random Forest model is a classified result, as 1 or 0. The output of the Logistic regression is a probability of the observation falling into the category.

Therefore, the latter gives us a better flexibility of deciding how we need to classify the output by changing the threshold probability depending on the application (default threshold we generally use is 0.5)

Eg:

Application: Predicting if a patient has a particular disease depending on symptoms.

Context: We have enough funds to treat the patients. The treatments have very few side effects. But if left untreated, the disease can be fatal. Therefore it is okay if we predict a patient has the disease erroneously, but we cannot misclassify a patient with disease as a healthy patient.

What to do: Reduce the decision threshold of the output, where our false-negative rate (% of patients misclassified as healthy) is 0

Multiclass classifications

If your target has more than 2 classifications, then

Random Forest can classify your data into each of them with just one model.

Logistic Regression — for 3 classifications, you need to train 2 models. For n > 3 classifications, you need to train ’n’ number of models

Deployment of the model

Random Forest — You will need to invoke the same trained model using your client application to run predictions on new data. If you develop the application in a different language you will need to look for ways on how to call your python-based model from the app.

Logistic Regression — You can either invoke the same model or you can export the model coefficients and deploy the mathematical expression elsewhere within your client application itself. This particularly makes deployment of the model easier and intuitive outside python environments.

The ultimate question, Which model performs better?

It entirely depends on your data set. The only way to know is; Test, iterate and test!

--

--

Bemali Wickramanayake

A business strategist and a self taught data visualization expert. Runs a business of helping other businesses to make better decisions with data. And a reader.