Different types of Hyper-Parameter Tuning.

This article aims to show python implementation for different Hyperparameter Tuning techniques using the RandomForest model.

Abhigyan
Analytics Vidhya
13 min readJul 17, 2021

--

Contents:

→ Importance of Hyper-Parameter Tuning!
→ Hyperparameter Tuning/Optimization
→ Defining Functions
→ Checking Performance on Base Model
→ Different Hyperparameter Tuning Methods

1. GridSearch
2. RandomSearch
3. Successive Halving
4. Bayesian Optimizers
5. Manual Search

→ Difference between Parameters and Hyperparameters
→ Conclusion

Hyperparameters are the soul of any model present in today’s ML world. The values of Hyperparameters needs to be passed manually as they cannot be learned, which then controls the whole Learning Process.

Hyperparameters are needed to be set before fitting the data in order to get a more robust and optimized model.

Importance of Hyper-Parameter Tuning!

  1. The goal of any model is to achieve a minimum error, Hyperparameters help achieve that as they are responsible for the outcome of any ML models.
  2. It influences the convergence of any ML Algorithm to a large extent.

Hyperparameter Tuning/Optimization

The process that involves the search of the optimal values of hyperparameters for any machine learning algorithm is called hyperparameter tuning/optimization.

I will use pulsar star data, You can download the data from the Kaggle Link.

Complete Code can be found in my GitHub repo.

Defining Functions

Function to evaluate Train Set.

Function to evaluate Test Set

Function to calculate time take

Checking Performance on Base Model

→ Checking default Parameters of the RandomForest Base Model

Performance on Train set

Performance on Test set

Different Hyperparameter tuning methods:

1. GridSearch:

  • Grid search picks out hyperparameter values by combining each value passed in the grid to each other, evaluates every one of them, and returns the best.
  • This leads to searching through the entire grid of the selected data.
  • GridSearch may suffer from the Curse of Dimentionality, as more the parameters we pass, the more time and space will be taken by the parameters to perform the search.

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces(higher feature count)that do not occur in low-dimensional spaces(lower feature count).
This means the more dimensions we add, the more the search will increase in time complexity, ultimately making this strategy inconvenient.

providing a dictionary of hyperparameters

OUTPUT:

Now, We fit the GridSearch model to find the set of optimal hyperparameter values.

The model will try out 324 combinations of hyperparameters.This gives you an idea of how grid search increases the Time Complexity.
2 of bootstrap
3 of max_depth
2 of max_features
3 of min_samples_leaf
3 of min_samples_split
3 of n_estimators
which gives a combination 2*3*2*3*3*3 = 324

OUTPUT:

Performance on Train Set

Performance on Test Set

2. RandomSearch:

  • Random Search removes the exhaustive search done by GridSearch by combining the values randomly.
  • Since the selection of parameters is completely random; it yields high variance during computing.
  • For example,
    Instead of checking all100 samples,RandomSearch checks 50 random parameters.
  • However, There is a trade-off to decreasing the time complexity. It is good at testing a wide range of values and normally it reaches a very good combination very fast, but the problem is that it doesn’t guarantee to give the best parameters combination.

Using the same dictionary of hyperparameters

Now,we fit the RandomSearch Model.This will take some time to execute.Depending on the size of the data.

Note:
→ The most important arguments in RandomizedSearchCV are n_iter, it handles the number of different combinations of data to try.
→ cv which is the number of folds to use for cross validation.Increasing cv folds reduces the chances of overfitting, but will increase the run time.

OUTPUT:

Performance on Train Set

Performance on Test Set

3. Successive Halving:

Scikit-learn also provides the HalvingGridSearchCV and HalvingRandomSearchCV estimators that can be used to search a parameter space using successive halving

  • Successive halving (SH) is like a tournament among candidate parameter combinations.
  • SH is an iterative selection process where all candidates (the parameter combinations) are evaluated with a small amount of resources at the first iteration.
  • Only some of these candidates are selected for the next iteration, which will be allocated more resources.
  • For parameter tuning, the resource is typically the number of training samples, but it can also be an arbitrary numeric parameter such as n_estimators in a random forest.

3.1 — Halving GridSearch

Using the same dictionary of hyperparameters

Checking Best Parameters

Performance on Train Set

Performance on Test Set

3.2 — Halving RandomSearch

Using the same dictionary of hyperparameters

Checking Best Parameters

Performance on Train Set

Performance on Test Set

Complete Code can be found in my GitHub repo.

4. Bayesian Optimizers:

4.1 — Hyperopt

Hyperopt is a Python library for serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions.

Defining Search Space

Defining Function to minimize

Minimizing the function

Checking Best Parameters

Fitting Base Model with the set of best Parameters

Performance on Train Set

Performance on Test Set

4.2 — Optuna

  • Eager dynamic search spaces
  • Efficient sampling and pruning algorithms
  • Easy integration
  • Good visualizations
  • Distributed optimization

Defining Function

Creating Study

Minimizing the Function

Checking Best Parameters

Fitting Base Model with the set of best Parameters

Performance on Train Set

Performance on Test Set

Plotting Optimization History

4.5 — Scikit-Optimize

  • Sequential model-based optimization
  • Built on NumPy, SciPy, and Scikit-Learn
  • Open source, commercially usable

Skopt:
Defining Search Space

Defining Objective Function to minimize

Minimizing the Objective Function

Checking Best Parameters

Fitting Base Model with the set of best Parameters

Performance on Train Set

Performance on Test Set

Plotting Convergence Graph

4.4 — BayesSearchCV

As of now BayesSearchCV is not compatible with sklearn 0.24 version.
To use BayesSearch downgrade sklearn to 0.23.2

Defining Search Space

fitting the bayessearchCV

Checking Best Parameters

Performance on Train Set

Performance on Test Set

Plotting Objective

5. Manual Search:

  • Manual Search can be done on the basis of our judgment/experience.
  • We train the model based on the random values that we assigned manually, evaluate its accuracy and start the process again.
  • This loop is repeated until a satisfactory accuracy is scored.

Difference between Parameters and Hyperparameters

→ Model Parameters: These are learnt when the model is running and recognizing the data.
Model Parameters differ from experiment to experiment and completely depends on the type of data passed and the task being solved.

Some examples of model parameters include:

  • The weights in an artificial neural network(ANN).
  • The support vectors in a support vector machine.
  • The coefficients in linear regression or logistic regression.
  • For NLP task: word frequency, sentence length, noun or verb distribution per sentence, the number of specific character n-grams per word, lexical diversity, etc.

→ Hyperparameters: These are the values that your model expects to be passed to obtain the optimal performance on any given data, for any task.

Some examples of model hyperparameters include:

  • The learning rate for training a neural network.
  • The C and sigma hyperparameters for support vector machines.
  • The k in k-nearest neighbors.
  • Depth of tree in Decision trees

Difference between Parameters and Hyperparameters

Conclusion

After using all the different methods and creating a dataframe from the results, so that we can compare each of the techniques.

→ Sorting with respect to f1 score on test

→ Sorting with respect to difference between f1 score on train and test

After, Sorting the Values with Respect to F1 score for train and set test, it turns out that bayesian techniques worked the best.

However, in the production environment we not only have to get the best result, but also, as quickly as possible and with respect to that RandomSerach performed the best.

Complete Code can be found in my GitHub repo.

Like my article? Do give me a clap and share it, as that will boost my confidence.
Also, check out my other post and stay connected for future articles on the basics of data science and machine learning series.

Also, do connect with me on LinkedIn.

Photo by Alex on Unsplash

--

--