Tuning of Adaboost with Computational Complexity

Srijani Chaudhury
10 min readAug 24, 2020

--

Photo by Ev on Unsplash

Boosting is a class of ensemble machine learning algorithms that involve combining the predictions from many weak learners.

A weak learner is a model that is very simple, although has some skill on the dataset. Boosting was a theoretical concept long before a practical algorithm could be developed, and the AdaBoost (adaptive boosting) algorithm was the first successful approach for the idea.

The AdaBoost algorithm involves using very short (one-level) decision trees as weak learners that are added sequentially to the ensemble. Each subsequent model attempts to correct the predictions made by the model before it in the sequence. This is achieved by weighing the training dataset to put more focus on training examples on which prior models made prediction errors.

We call the algorithm AdaBoost because, unlike previous algorithms, it adjusts adaptively to the errors of the weak hypotheses

A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting, 1996.

the new algorithm needs no prior knowledge of the accuracies of the weak hypotheses. Rather, it adapts to these accuracies and generates a weighted majority hypothesis in which the weight of each weak hypothesis is a function of its accuracy.

A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting, 1996.

For further knowledge:

In this blog, I have tried to explain:

  • Adaboost using Scikit-Learn
  • Tuning Adaboost Hyperparameters
  • Grid Search Adaboost Hyperparameter
  • Train time complexity, Test time complexity, and Space complexity of Adaboost.

1.Adaboost using Scikit-Learn

Adaboost is generally used for classification problems, so we use the Adaboost Classifier.

Tip: Before using Scikit-learn check the version first. As we the Machine Learning modules are continuously updated by the developers and so our intention is to use the latest technology.

# check scikit-learn version
import sklearn
print(sklearn.__version__)

This gives the latest version as output.

0.22.1

If the output matches your Scikit-Learn version, then it is ok. :)

Now we import AdaboostClassifier class and use a make_classification() function to create a random dataset.

Thus we created a dataset of 1000 rows and 20 columns. Now let’s see the output.

(1000, 20) (1000,)

Randomness is used in the construction of the model. This means that each time the algorithm is fit into the data, it will produce slightly different results.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

We will evaluate the model by using repeated classified k fold cross-validation, with three repeats and 10 folds.

Using the default parameters we get an accuracy of around eighty percent.

Accuracy: 0.806 (0.041)

1(a)AdaboostClassfier Parameters

  1. base_estimator: This parameter is used to signify the type of base learners we can implement or the type of weak learner we want to use. It can Decision tree, Logistic Regressor, SVC anything. It cannot be Knn as the weight cannot be assigned in this model. By default, the base estimator is DecisionTreeClassifier(max_depth=1).
  2. n_estimators: The number of base estimators or weak learners we want to use in our dataset. By default, the n_estimator is 50.
  3. learning_rate: This parameter is provided to shrink the contribution of each classifier. By default, it is provided a value of 1.
  4. algorithm: It can be either SAMME or SAMME.R. The performance of the SAMME and SAMME.R algorithms are compared. SAMME.R uses the probability estimates to update the additive model, while SAMME uses the classifications only. As the example illustrates, the SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations. As we have seen, SAMME.R breaks after the error goes above 1/2. However this is not the case for SAMME although error can be bigger than 1/2 (or equal to 1/2), the weight of the estimator is still positive; hence, the misclassified training samples get more weights, and the test error keeps decreasing even after 600 iterations.

1(b) Adaboost Attributes.

  1. estimators: The list of classifiers provided to be fit into the model.
  2. classes: The class labels.
  3. estimator_weights_: The weight assigned to each base estimator.
  4. estimator_errors_: Classification error for each estimator in the boosted ensemble.

5. feature_importance_: Shows us which column has more importance than the other.

Hyperparameter tuning with Adaboost

Let us play with the various parameters provided to us by the AdaBoost class and observer the accuracy changes:

Explore the number of trees

An important hyperparameter for Adaboost is n_estimator. Often by changing the number of base models or weak learners we can adjust the accuracy of the model. The number of trees added to the model must be high for the model to work well, often hundreds, if not thousands. Afterall the more is the number of weak learners, the more the model will change from being high biased to low biased.

Let’s see the code:

The output of the code:

>10 0.773 (0.039)
>50 0.806 (0.041)
>100 0.801 (0.032)
>500 0.793 (0.028)
>1000 0.791 (0.032)
>5000 0.782 (0.031)

You can observe the trend from the interactive Graph given below.

From above we can observe that the accuracy increases up to 50 n_estimators but after that, it decreases. This shows that after 50 n_estimators we may be overfitting the model which we don’t want at all.

Explore Weak Learner

As we know Adaboost uses a series of weak learners and makes the final prediction depending upon the weighted say of these weak learners. So let’s explore the weak learner. We basically are exploring the depth of the decision tree.

The output:

>1 0.806 (0.041)
>2 0.864 (0.028)
>3 0.867 (0.030)
>4 0.889 (0.029)
>5 0.909 (0.021)
>6 0.923 (0.020)
>7 0.927 (0.025)
>8 0.928 (0.028)
>9 0.923 (0.017)
>10 0.926 (0.030)

We can observe the trend from the interactive graph given below.

We see here, with the increase in depths, the accuracy is increasing with the highest accuracy of 92.6 percent with a deviation of +/- 0.03. You can explore for some higher depths. But we should be careful, not to overfit the decision tree.

Explore Learning Rate

The learning rate controls the loss function used for calculating the weight of the base models.

Weight=learning rate*log(1-e/e), where e is the error

The learning rate depends highly upon the number of n_estimators. By default, it is set to 1 but it can be increased or decreased depending on the estimators used. Generally, for a large number of n_estimators, we use a smaller value of learning rate. For example when our weak classifier has the chances of right predictions just slightly more than random guess then the learning rate is 0.5. It is common to use a smaller value of learning rate ranging between 0 and 1, like 0.1,0.001 because otherwise, it gives rise to the problem of overfitting.

Let`s see the code:

The output :

>0.100 0.767 (0.049) 
>0.200 0.786 (0.042)
>0.300 0.802 (0.040)
>0.400 0.798 (0.037)
>0.500 0.805 (0.042)
>0.600 0.795 (0.031)
>0.700 0.799 (0.035)
>0.800 0.801 (0.033)
>0.900 0.805 (0.032)
>1.000 0.806 (0.041)
>1.100 0.801 (0.037)
>1.200 0.800 (0.030)
>1.300 0.799 (0.041)
>1.400 0.793 (0.041)
>1.500 0.790 (0.040)
>1.600 0.775 (0.034)
>1.700 0.767 (0.054)
>1.800 0.768 (0.040)
>1.900 0.736 (0.047)
>2.000 0.682 (0.048)

We can see the trends from the interactive graph below:

As we can see the best accuracy is for learning rate 1.00. Greater than this the accuracy seems to be decreasing thus giving the indication of overfitting.

Explore with Alternate Algorithms

Other than Decision trees we can use various other weak learner models like Simple Virtual Classifier or Logistic Regressor. The intent is to use weak learners only. We can use any weak learner as far as it supports the weighting of the samples. We can explore by playing with the base_estimator parameter.

The code below illustrates what I meant to say:

The output:

> 0.707 (0.083)
> 0.794 (0.032)

You can observe the trends from the interactive graph below:

Thus we observe SVC is a weaker classifier than Logistic Regressor

Using GridSearchCV

Adaboost can be sometimes difficult to tune because it consists of many hyperparameters. Using GridSearchCV is always a smart approach. Popular search processes include a random search and a grid search.

In this case, we will grid search two key hyperparameters for AdaBoost: the number of trees used in the ensemble and the learning rate. We will use a range of popular well-performing values for each hyperparameter.

Each configuration combination will be evaluated using repeated k-fold cross-validation and configurations will be compared using the mean score, in this case, classification accuracy.

In this case, we can see that a configuration with 500 trees and a learning rate of 0.1 performed the best with a classification accuracy of about 81.3 percent

The accuracy might increase with more number of trees like 5000 or 1000 or above.

Best: 0.813667 using {'learning_rate': 0.1, 'n_estimators': 500}
0.646333 (0.036376) with: {'learning_rate': 0.0001, 'n_estimators': 10}
0.646667 (0.036545) with: {'learning_rate': 0.0001, 'n_estimators': 50}
0.646667 (0.036545) with: {'learning_rate': 0.0001, 'n_estimators': 100}
0.647000 (0.038136) with: {'learning_rate': 0.0001, 'n_estimators': 500}
0.646667 (0.036545) with: {'learning_rate': 0.001, 'n_estimators': 10}
0.647000 (0.038136) with: {'learning_rate': 0.001, 'n_estimators': 50}
0.654333 (0.045511) with: {'learning_rate': 0.001, 'n_estimators': 100}
0.672667 (0.046543) with: {'learning_rate': 0.001, 'n_estimators': 500}
0.648333 (0.042197) with: {'learning_rate': 0.01, 'n_estimators': 10}
0.671667 (0.045613) with: {'learning_rate': 0.01, 'n_estimators': 50}
0.715000 (0.053213) with: {'learning_rate': 0.01, 'n_estimators': 100}
0.767667 (0.045948) with: {'learning_rate': 0.01, 'n_estimators': 500}
0.716667 (0.048876) with: {'learning_rate': 0.1, 'n_estimators': 10}
0.767000 (0.049271) with: {'learning_rate': 0.1, 'n_estimators': 50}
0.784667 (0.042874) with: {'learning_rate': 0.1, 'n_estimators': 100}
0.813667 (0.032092) with: {'learning_rate': 0.1, 'n_estimators': 500}
0.773333 (0.038759) with: {'learning_rate': 1.0, 'n_estimators': 10}
0.806333 (0.040701) with: {'learning_rate': 1.0, 'n_estimators': 50}
0.801000 (0.032491) with: {'learning_rate': 1.0, 'n_estimators': 100}
0.792667 (0.027560) with: {'learning_rate': 1.0, 'n_estimators': 500}

Computational Complexity:

Time complexity is the computational complexity that describes the amount of time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by the algorithm, supposing that each elementary operation takes a fixed amount of time to perform. Thus, the amount of time taken and the number of elementary operations performed by the algorithm is taken to differ by at most a constant factor.

As we know that while running an algorithm at different operations, different amount of time is required for execution. So we generally consider the worst-case time complexity. As we consider the “worst-case” we are eliminating the flaws in the hardware or some external factors that might delay the computational time. Not only that, but we can also actually improve the performance of an algorithm when we are considering its “worst” situation.

“Sometimes, when I try to understand a person’s motives, I play a little game. I assume the worst. What’s the worst reason they could possibly have for saying what they say, or doing what they do? Then I ask myself, ‘how well does that reason explain what they say and what they do?’” — Lord Peter Baelish(Game of thrones)(………….Just tried to add a little drama for GoT fans:))

So like GoT there is a Peter Baelish in ML world who is known as Big O.

Big O Notation commonly written as O, is an Asymptotic Notation for the worst case for a given function. It provides us with an asymptotic upper bound for the growth rate of the runtime of an algorithm.O is generally denoted as a function of some feature of the algorithm.

For further information about Big O click here.

Some real-life examples:

O(n2): You go and ask the first person in the class if he has a pen. Also, you ask this person about the other 99 people in the classroom if they have that pen and so on,
This is what we call O(n2).

O(n): Going and asking each student individually is O(N).

O(log n): Now I divide the class into two groups, then ask: “Is it on the left side, or the right side of the classroom?” Then I take that group and divide it into two and ask again, and so on. Repeat the process until you are left with one student who has your pen. This is what you mean by O(log n).

Space complexity is a measure of the amount of working storage an algorithm needs. That means how much memory, in the worst case, is needed at any point in the algorithm. As with time complexity, we’re mostly concerned with how the space needs grow, in Big-Oh terms, as the size N of the input problem grows.

Now we need to consider the time and space complexity while constructing an ML algorithm. Our intent would be how to minimze this complexities. In order to construct an efficient predicting system, we have to choose a ML model depending on the situations and their complexities

Computational Complexity of Adaboost

  • Train time complexity:O(n p n_trees)
  • Test time complexity:O(p n_trees)

Here n denotes the data points in the dataset, p the number of features and n_trees suggest the number of estimators we are using in our model.

While training the model we are going through n variables for p features to find out the decision stump, and then calculate their loss function and to update the weights in this way we are taking O(np) operations. Now the same is repeated for n_trees number of estimators so it takes O(npn_trees) operations.

While testing the model predicts depending on the decision stump or the weak learner which comprises only one feature(as we consider the entropy of each column and find the columns with the least entropy and split it on the basis of the threshold). Now this operation is repeated for each base estimator and each one considers one feature at a time so this takes O(pn_trees).

--

--