Bagging Classifier()

Pedro Meira
Time to Work
Published in
4 min readOct 10, 2019

Instead of running various models on a single dataset, you can use a single model over various random subsets of the dataset. Random sampling with replacement is called Bagging, short for bootstrap aggregating. In case that was difficult to visualize in your head, just imagine disregarding several random entries in the dataset and modelling with the rest. In Pasting algorithm, the same process applies, but the pasting doesn’t allow training instances to be sampled several times for the same predictors.

Case Study: 1994 census US Income

Professor Ivanovitch has implemented a code using the 1994 census data set on U.S. income. It contains information on marital status, age, type of work, and more. The target column, high_income, records salaries less than or equal to 50k a year (0), and more than 50k a year (1).

You can download the data from the University of California, Irvine’s website.

My job is to complement the initial code using the Bagging Classifier and to compare to Random Forest Classifier and Decision Tree Classifier.

Are you with me? So, hey ho! Let’s go!

1. Load Libraries

2. Get Data

Importing

Now we are going to verify if there are missing values:

Income Info Output

3. Cleaning, preparing and manipulating the Dataset (Feature Engineering)

This dataset contains a mix of categorical (9 columns) and numerical (6 columns) independent variables which as we know will need to pre-processed in different ways and separately.

This means that initially they’ll have to go through separate pipelines to be pre-processed appropriately and then we’ll combine them together. So the first step in both pipelines would have to be to extract the appropriate columns that need to be pushed down for pre-processing.

Categorical Pipeline

4. Modeling (train and test)

5. Algorithm Tuning

Algorithm Tuning and the best result

The Random Forest Classifier presents the best result.

Dataframe Results
Results

The results DataFrame has 134 rows and 70 columns! :0

Finding the best results:

We will see the best result using the AUC scorer:

Result AUC
AUC results

We want to see the best result. So, we will search the rank_test_AUC by the number one:

Best result — AUC rank

And now we will see the best result using the Accuracy score:

Accuracy Score Results
Accuracy Score Results

And we need to see the best result:

5. Finalizing the model

Discussion

AUC: the mean_train_AUC is closer then the mean_test_AUC, althought the std_train_AUC is 2 orders of magnitude minor then the std_test_AUC, indicating a overfit.

Accuracy: the mean_train_Accuracy is closer then the mean_test_Accuracy, althought the std_train_Accuracy is 2 orders of magnitude minor then the std_test_Accuracy, indicating a overfit.

Bibliograph

--

--