Ensemble Methods: Bagging and Pasting in Scikit-Learn

Silvan
6 min readOct 15, 2019

--

Lots of trees (by kjpargeter)

In machine learning, sometimes multiple predictors grouped together have a better predictive performance than anyone of the group alone. These techniques are very popular in competitions and in production. They are called Ensemble Learning.

There are several ways to group models. They differ in the training algorithm and data used in each one of them and also how they are grouped. We’ll be talking in the article about two methods called Bagging and Pasting and how to implement them in scikit-learn.

But before we begin talking about Bagging and Pasting, we have to know what is Bootstrapping.

Bootstrapping

In statistics, bootstrapping refers to a resample method that consists of repeatedly drawn, with replacement, samples from data to form other smaller datasets, called bootstrapping samples. It’s as if the bootstrapping method is making a bunch of simulations to our original dataset so in some cases we can generalize the mean and the standard deviation.

For example, let’s say we have a set of observations: [2, 4, 32, 8, 16]. If we want each bootstrap sample containing n observations, the following are valid samples:

  • n=3: [32, 4, 4], [8, 16, 2], [2, 2, 2]…
  • n=4: [2, 32, 4, 16], [2, 4, 2, 8], [8, 32, 4, 2]…

Since we drawn data with replacement, the observations can appear more than one time in a single sample.

Bagging & Pasting

Bagging means bootstrap+aggregating and it is a ensemble method in which we first bootstrap our data and for each bootstrap sample we train one model. After that, we aggregate them with equal weights. When it’s not used replacement, the method is called pasting.

Out-of-Bag Scoring

If we are using bagging, there’s a chance that a sample would never be selected, while anothers may be selected multiple time. The probability of not selecting a specific sample is (1–1/n), where n is the number of samples. Therefore, the probability of not picking n samples in n draws is (1–1/n)^n. When the value of n is big, we can approximate this probability to 1/e, which is approximately 0.3678. This means that when the dataset is big enough, 37% of its samples are never selected and we could use it to test our model. This is called Out-of-Bag scoring, or OOB Scoring.

Random Forests

As the name suggest, a random forest is an ensemble of decision trees that can be used to classification or regression. In most cases, it is used bagging. Each tree in the forest outputs a prediction and the most voted becomes the output of the model. This is helpful to make the model with more accuracy and stable, preventing overfitting.

Another very useful property of random forests is the ability to measure the relative importance of each feature by calculating how much each one reduce the impurity of the model. This is called feature importance.

A scikit-learn Example

To see how bagging works in scikit-learn, we will train some models alone and then aggregate them, so we can see if it works.

In this example we’ll be using the 1994 census dataset on US income. It contains informations such as marital status, age, type of work and more. As target column we have a categorical data type that informs if a salary is less than or equal 50k a year(0) or not(1). Let’s explore the DataFrame with Pandas’ info method:

RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):age 32561 non-null int64
workclass 32561 non-null object
fnlwgt 32561 non-null int64
education 32561 non-null object
education_num 32561 non-null int64
marital_status 32561 non-null object
occupation 32561 non-null object
relationship 32561 non-null object
race 32561 non-null object
sex 32561 non-null object
capital_gain 32561 non-null int64
capital_loss 32561 non-null int64
hours_per_week 32561 non-null int64
native_country 32561 non-null object
high_income 32561 non-null int8
dtypes: int64(6), int8(1), object(8)

As we can see, there’s numerical(int64 and int8) and categorical(object) data types. We have to deal with each type separately to send to the predictor.

Data Preparation

First we load the CSV file and convert the target column to categorical, so when we are passing all columns to the pipeline we don’t have to worry about the target column.

import numpy as np
import pandas as pd
# Load CSV
df = pd.read_csv('data/income.csv')
# Convert target to categorical
col = pd.Categorical(df.high_income)
df["high_income"] = col.codes

There’s numerical and categorical columns in our dataset. We need to make different preprocessing in each one of them. The numerical features need to be normalized and the categorical features need to be converted to integers. To do this, we define a transformer to preprocess our data depending on it’s type.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MinMaxScaler
class PreprocessTransformer(BaseEstimator, TransformerMixin):
def __init__(self, cat_features, num_features):
self.cat_features = cat_features
self.num_features = num_features
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
df = X.copy()
# Treat ? workclass as unknown
df.loc[df['workclass'] == '?', 'workclass'] = 'Unknown'
# Too many categories, just convert to US and Non-US
df.loc[df['native_country']!='United-States','native_country']='non_usa'
# Convert columns to categorical
for name in self.cat_features:
col = pd.Categorical(df[name])
df[name] = col.codes
# Normalize numerical features
scaler = MinMaxScaler()
df[self.num_features] = scaler.fit_transform(df[num_features])
return df

The data is then splitted into train and test, so we can see later if our model generalized to unseen data.

from sklearn.model_selection import train_test_split# Split the dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(
df.drop('high_income', axis=1),
df['high_income'],
test_size=0.2,
random_state=42,
shuffle=True,
stratify=df['high_income']
)

Build the Model

Finally, we build our models. First we create a pipeline to preprocess with our custom transformer, select the best features with SelectKBest and train our predictors.

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
random_state = 42
leaf_nodes = 5
num_features = 10
num_estimators = 100
# Decision tree for bagging
tree_clf = DecisionTreeClassifier(
splitter='random',
max_leaf_nodes=leaf_nodes,
random_state=random_state
)
# Initialize the bagging classifier
bag_clf = BaggingClassifier(
tree_clf,
n_estimators=num_estimators,
max_samples=1.0,
max_features=1.0,
random_state=random_state,
n_jobs=-1
)
# Create a pipeline
pipe = Pipeline([
('preproc', PreprocessTransformer(categorical_features, numerical_features)),
('fs', SelectKBest()),
('clf', DecisionTreeClassifier())
])

Since what we are trying to do is see the difference between a simple decision tree and an ensemble of them, we can use scikit-learn’s GridSearchCV to train all predictors using a single fit method. We are using AUC and accuracy as scoring and a KFold with 10 splits as cross-validation.

from sklearn.model_selection import KFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, make_scorer
# Define our search space for grid search
search_space = [
{
'clf': [DecisionTreeClassifier()],
'clf__max_leaf_nodes': [128],
'fs__score_func': [chi2],
'fs__k': [10],
},
{
'clf': [RandomForestClassifier()],
'clf__n_estimators': [200],
'clf__max_leaf_nodes': [128],
'clf__bootstrap': [False, True],
'fs__score_func': [chi2],
'fs__k': [10],
}
]
# Define scoring
scoring = {'AUC':'roc_auc', 'Accuracy':make_scorer(accuracy_score)}
# Define cross validation
kfold = KFold(n_splits=10, random_state=42)
# Define grid search
grid = GridSearchCV(
pipe,
param_grid=search_space,
cv=kfold,
scoring=scoring,
refit='AUC',
verbose=1,
n_jobs=-1
)
# Fit grid search
model = grid.fit(X_train, y_train)

The mean of AUC and accuracy for each model tested on GridSearchCV are:

  • Single model: AUC = 0.791, Accuracy: 0.798
  • Bagging: AUC = 0.869, Accuracy = 0.816
  • Pasting: AUC = 0.870, Accuracy = 0.815
  • Native random forest: AUC = 0.887, Accuracy = 0.838

As expected, we had better results with the ensemble methods, even if the constituent parts are the same training algorithm with the same parameters as the single one.

Since the best estimator was the random forest, we can visualize the OOB score and the features importance by:

best_estimator = grid.best_estimator_.steps[-1][1]
columns = X_test.columns.tolist()
print('OOB Score: {}'.format(best_estimator.oob_score_))
print('Feature Importances')

for i, imp in enumerate(best_estimator.feature_importances_):
print('{}: {:.3f}'.format(columns[i], imp))

Which prints:

OOB Score: 0.8396805896805897

Feature Importances:
age: 0.048
workclass: 0.012
fnlwgt: 0.167
education: 0.138
education_num: 0.001
marital_status: 0.329
occupation: 0.009
relationship: 0.259
race: 0.012
sex: 0.025

Conclusion

We saw that ensemble methods can increase the performance of the model. The bagging method uses bootstrapping to produce subsets of the original dataset that are used to train each element in an ensemble. This helps our model to generalize better, reducing overfitting and increasing accuracy. We also saw that a random forest is an ensemble of decision trees and through it, we can calculate the importance of each feature. All of it can be easily implemented using scikit-learn library.

I hope you liked it. See you next time! :)

--

--