Supervised Learning in R: Bagging

Fatih Emre Ozturk, MSc
5 min readNov 14, 2023

--

In the previous post we talked about Regression Tree. As known, Regression Trees suffer from high variance. In other words, if we randomly divide the tranining data into two and build a regression tree with each of them separately; the results obtained will be quite different from each other. The same situation for linear regression is low variance if the ratio of the number of observations and independent variables is large. In Bagging, also known as Bootstrap Aggregation, the goal is to reduce the variance of any statistical learning method. It is especially useful for decision trees (regression tree and classification tree).

Averaging a certain number of observations reduces the variance. Therefore, the most natural way to reduce the variance of a statistical learning algorithm and ultimately improve its performance on test data is to take many training data from the population and build models using them separately. Ultimately, it would be to average the predictions of these models. In other words, by building our model for X different tranining sets and averaging them, we can get a single low variance model.

However, we usually do not have multiple traninig sets. Therefore, this method is not very practical. Instead, we can select repeated samples from the single traning data we have. In other words, we can bootstrap. We call this process bagging.

Now, let’s take it step by step for clarity:

1. Select B repeated random samples of the same width from the original sample, with replacement.
2. A regression tree is constructed for each of the B samples.
3. The predictions of B trees are averaged.

As you can observe, in the second step we build a regression tree for each of the B samples. By nature, regression trees can be large. In fact, in the previous post, we pruned our tree to reduce the variance, but bagging does not involve any pruning. Therefore, each tree has high variance and low bias. But when we take the average of all these trees, we also reduce the variance. Bagging combines hundreds or even thousands of trees in this way to make impressive improvements.

Out-of-Bag Error Estimation

Recall step one in this section. We take replacement and repeated samples of the same width from the original sample. Each time this is done, about two thirds of the original training set is selected. In other words, the remaining third is not used in the estimation of the tree. These observations are called OOB or Out of Bag.

Using each OOB observation, we can make a prediction for observation i. In this case, we get approximately B/3 predictions for observation i. We can then estimate the OOB error by averaging all these OOB estimates. With OOB error estimation, we can evaluate the performance of the model.

Variable Importance

Although Bagging improves accuracy by using multiple trees and averaging their predictions, multiple trees can be problematic. In particular, not having a single tree to display can make the resulting model very difficult to interpret.

To avoid this problem, we can calculate the importance of variables. Each tree or model measures the importance of variables within itself. Then, the variable importance of all trees or models can be averaged to determine the overall variable importance. This shows the contribution of each variable to the prediction performance of the model.

Bagging in R

For Bagging in R, we will use Hitters dataset in ISLR package as in the regression tree post. Using this dataset, which contains various information about baseball players, we will try to predict players’ salaries. We will have two independent variables: year and hits. Year is the number of years a player has played in the major leagues, while hits is the number of hits in the last year. Since there are NA’s in the data set. We will just remove them. To make it more bell-shaped distribution, we will also log-transfrom salary.

library(ISLR)
df <- Hitters
df <- na.omit(df)
df$Salary <- log(df$Salary)

df <- as.data.frame(df)

Train — test split:

smp_size <- floor(0.75 * nrow(df)) 
set.seed(2021900444)
train_ind <- sample(nrow(df), size = smp_size, replace = FALSE)
train <- df[train_ind, ]
test <- df[-train_ind, ]

In R, you can use bagging function fromipred package to build bagging models.

library(ipred)

bag <- bagging(
formula = Salary~Years+Hits, # formula
data = train, # data set
nbagg = 150, # number of bagging trees to created
coob = TRUE # whether oob is calculated
)

#display fitted bagged model
bag
Bagging regression trees with 150 bootstrap replications 

Call: bagging.data.frame(formula = Salary ~ Years + Hits, data = train,
nbagg = 150, coob = TRUE)

Out-of-bag estimate of root mean squared error: 0.5244

This output shows the configuration and the performance of the model. It can be interpreted as follows:

  • Bagging is indicated to be applied on regression trees. This means that multiple regression trees are created and they are combined with the bagging technique to create an overall model.
  • It is stated to be done with 150 bootstrap replications. Bootstrap repetitions refers to the number of repetitions of the bagging process, where different models are created by taking random samples from the original data set and combining them together to make an overall prediction.
  • The formula states: Salary ~ Years + Hits. This indicates that the dependent variable of the model is “Salary” and the independent variables are “Years” and “Hits”.
  • coob = TRUE is specified. This implies that out-of-bag predictions will be calculated and these predictions will be used to measure the performance of the model.

The output also gives an indication of the model’s performance:

  • “Out-of-bag estimate of root mean squared error: 0.5244” indicates that the root mean squared error of the model using out-of-bag estimates is 0.5244. This refers to how much error the model’s predictions on out-of-bag data make on average. A low RMSE is associated with better model performance, so 0.5244 can be interpreted as a low error rate. This indicates that the model is generally a good predictor.

You can also check variable importance with varImp function from stats package.

varImp(bag)
       Overall
Hits 1.505447
Years 1.312232

In this example, it seems easier to interpret because we have fewer independent variables. However, you can follow the following steps for to plot variable importance:

#calculate variable importance
VI <- data.frame(var=names(df-1), imp=varImp(bag))

#sort variable importance descending
VI_plot <- VI[order(VI$Overall, decreasing=TRUE),]

#visualize variable importance with horizontal bar plot
barplot(VI_plot$Overall,
names.arg=rownames(VI_plot),
horiz=TRUE,
col='steelblue',
xlab='Variable Importance')

Just like always:

“In case I don’t see ya, good afternoon, good evening, and good night!”

Reference and Further Reading:

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 304). New York: springer.

--

--