Supervised Learning in R: Random Forest

Fatih Emre Ozturk, MSc
5 min readDec 29, 2023

In the previous post we talked about Bagging. As a reminder and to familiarize ourselves with the basic idea behind Random Forest, let’s assume that we have a dataset where a predictor is very dominant. In Bagging, this strong predictor will be used in the top bin for most or all of the trees. The resulting trees will be quite similar to each other. This means that the predictions from the bagged trees will be highly correlated. Averaging many highly correlated features will also not lead to a large reduction in variance. Therefore, in such a situation bagging will not give very good results. In this case, the best option would be to use random forests. -1-

Random Forest (RF) is an ensemble learning method used in machine learning, which works by constructing multiple decision trees and combining their predictions to improve the overall accuracy and reduce overfitting. Trees are constructed by randomly selecting m of p predictors. In regression, m is chosen approximately as p/3, while for classification it is preferred as sqrt(p). -1-,-2-, -3-

How to Calculate Random Forest: Step by Step

  1. Data selection

The algorithm starts by selecting a random subset of the dataset to create a decision tree. This process is repeated multiple times, and the decision trees are combined to make the final prediction.

2. Feature selection

For each decision tree, the algorithm selects the best feature to split the data based on a metric such as entropy or Gini impurity. This process helps to reduce overfitting and improve the generalization of the model.

Entropy is a measure of the impurity or randomness of a set of examples, while Gini impurity is a measure of the probability of misclassifying a randomly chosen example from the set. Both metrics are used to determine the best feature to split the data at each node of the decision tree. The feature with the highest information gain or reduction in impurity is selected as the splitting feature.

3. Tree construction

The algorithm grows the decision tree by recursively splitting the data based on the selected feature. This process continues until a stopping condition is met, such as when all instances in a node belong to the same class or when the node is pure.

4. Combining predictions

Once all decision trees are created, the algorithm combines their predictions to make the final output. This is typically done by aggregating the predictions of all trees or using a voting scheme.

Random Forest is successful in a wide range of datasets, including those with imbalanced classes, high dimensionality, and complex patterns. It can handle both numerical and categorical features, making it suitable for various machine learning tasks -1-, -3-

Random Forest in R

Random Forest can be applied in R using the randomForest package. This package provides a variety of functions for creating and managing Random Forest models. We will again use Hitters dataset from ISLR package. Using this dataset, which contains various information about baseball players, we will try to predict players’ salaries. We will have nine predictors.

Data prepration:

library(ISLR)
df <- Hitters
df <- na.omit(df)
df$Salary <- log(df$Salary)

library(dplyr)
df <- df %>% select(Salary, Years, Hits, Runs, RBI, Assists, Errors, AtBat, HmRun, Walks)

smp_size <- floor(0.75 * nrow(df))
set.seed(2021900444)
train_ind <- sample(nrow(df), size = smp_size, replace = FALSE)
train <- df[train_ind, ]
test <- df[-train_ind, ]

Building the model:

library(randomForest)
rfmodel <- randomForest(Salary~Years+ Hits+ Runs+ RBI+ Assists+ Errors+ AtBat+ HmRun+ Walks, # regression formula
data=train,# train set
mtry=3,# m which is p/3
importance=TRUE # whether variable importance is calculated
)
rfmodel
Call:
randomForest(formula = Salary ~ Years + Hits + Runs + RBI + Assists + Errors + AtBat + HmRun + Walks, data = train, mtry = 3, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 3

Mean of squared residuals: 0.2696054
% Var explained: 63.75
  1. Call line shows the parameters used when the randomForest function was called. It specifies the formula used (Salary ~ Years + Hits + Runs + RBI + Assists + Errors + AtBat + HmRun + Walks), the data used (train), the number of trees in the forest (500), the number of variables tried at each split (mtry = 3), and that importance measures should be calculated (importance = TRUE).
  2. Type of random forest indicates that it’s a regression random forest model, which means it’s used for predicting continuous outcomes (in this case, likely predicting salaries).
  3. Number of trees specifies the number of decision trees created in the random forest (500 in this case).
  4. No. of variables tried at each split indicates the number of variables randomly selected at each split in the decision tree building process. In this case, 3 variables are tried at each split.
  5. Mean of squared residuals, 0.2696054, represents the average of the squared differences between the predicted salaries and the actual salaries in the training data. Lower values indicate a better fit of the model to the data.
  6. % Var explained shows the percentage of variance in the response variable (Salary) that is explained by the model. In this case, the model explains 63.75% of the variance in the salaries of the training data.

We can observe important features as follows:

importance(rfmodel)
          %IncMSE IncNodePurity
Years 62.033210 58.850562
Hits 13.952948 15.117001
Runs 9.754573 9.708710
RBI 10.996043 17.717318
Assists 2.907466 4.718365
Errors 2.952730 5.294955
AtBat 12.208755 10.584976
HmRun 3.701014 7.399147
Walks 13.619357 12.841019

It is also possible to visualize variable importance as follows:

varImpPlot(rfmodel)

Just like always:

“In case I don’t see ya, good afternoon, good evening, and good night!”

References

1- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). _An introduction to statistical learning_ (Vol. 112, p. 344). New York: springer.
2- Alduailij, M., Khan, Q. W., Tahir, M., Sardaraz, M., Alduailij, M., & Malik, F. (2022). Machine-learning-based DDoS attack detection using mutual information and random forest feature importance method. _Symmetry_, _14_(6), 1095.
3- Rimal, Y. (2019, March). Machine Learning Random Forest Cluster Analysis for Large Overfitting Data: using R Programming. In _2019 6th International Conference on Computing for Sustainable Global Development (INDIACom)_ (pp. 1265–1271). IEEE.
4- Bakdi, A., Kristensen, N. B., & Stakkeland, M. (2022). Multiple instance learning with random forest for event logs analysis and predictive maintenance in ship electric propulsion system. _IEEE Transactions on Industrial Informatics_, _18_(11), 7718–7728.

--

--