Bagging Classifier()
Instead of running various models on a single dataset, you can use a single model over various random subsets of the dataset. Random sampling with replacement is called Bagging, short for bootstrap aggregating. In case that was difficult to visualize in your head, just imagine disregarding several random entries in the dataset and modelling with the rest. In Pasting algorithm, the same process applies, but the pasting doesn’t allow training instances to be sampled several times for the same predictors.
Case Study: 1994 census US Income
Professor Ivanovitch has implemented a code using the 1994 census data set on U.S. income. It contains information on marital status, age, type of work, and more. The target column, high_income, records salaries less than or equal to 50k a year (0), and more than 50k a year (1).
You can download the data from the University of California, Irvine’s website.
My job is to complement the initial code using the Bagging Classifier and to compare to Random Forest Classifier and Decision Tree Classifier.
Are you with me? So, hey ho! Let’s go!
1. Load Libraries
2. Get Data
Now we are going to verify if there are missing values:
3. Cleaning, preparing and manipulating the Dataset (Feature Engineering)
This dataset contains a mix of categorical (9 columns) and numerical (6 columns) independent variables which as we know will need to pre-processed in different ways and separately.
This means that initially they’ll have to go through separate pipelines to be pre-processed appropriately and then we’ll combine them together. So the first step in both pipelines would have to be to extract the appropriate columns that need to be pushed down for pre-processing.
Categorical Pipeline
4. Modeling (train and test)
5. Algorithm Tuning
The Random Forest Classifier presents the best result.
The results DataFrame has 134 rows and 70 columns! :0
Finding the best results:
We will see the best result using the AUC scorer:
We want to see the best result. So, we will search the rank_test_AUC by the number one:
And now we will see the best result using the Accuracy score:
And we need to see the best result:
5. Finalizing the model
Discussion
AUC: the mean_train_AUC is closer then the mean_test_AUC, althought the std_train_AUC is 2 orders of magnitude minor then the std_test_AUC, indicating a overfit.
Accuracy: the mean_train_Accuracy is closer then the mean_test_Accuracy, althought the std_train_Accuracy is 2 orders of magnitude minor then the std_test_Accuracy, indicating a overfit.