Hyper Parameter Tuning (GridSearchCV Vs RandomizedSearchCV)

Vishnu Satheesh
Analytics Vidhya
Published in
4 min readDec 22, 2020

Quite often data scientists deal with hyper parameter tuning in their day-to-day machine learning implementations. So what are hyper parameters and why do we need them? We will be discussing more on the two main types of Hyper parameter tuning, i.e., Grid Search CV and Randomized Search CV.

Photo by Denisse Leon on Unsplash

What are Hyper Parameters?

Hyper parameters are more like handles available to control the output or the behavior of the algorithm used for modeling. They can be supplied to algorithms as arguments. For eg: model= DecisionTreeClassifier(criterion=’entropy’), here the criterion entropy is the hyper parameter passed.

The function get_params() is the function used to get a list of all the hyper parameters for any algorithm.

When hyper parameters are not given to an algorithm, default values are picked to run the model. This makes hyper parameter tuning one of the critical steps involved in machine learning implementation.

Steps involved in hyper parameter tuning

  1. Choose the appropriate algorithm for the model
  2. Decide the parameter space
  3. Decide the method for searching parameter space
  4. Decide the cross-validation method
  5. Decide the score metrics to evaluate your model

In order to search the best values in hyper parameter space, we can use

  1. GridSearchCV (considers all possible combinations of hyper parameters)
  2. RandomizedSearchCV (only few samples are randomly selected)

Cross-validation is a resampling procedure used to evaluate machine learning models. This method has a single parameter k which refers to the number of partitions the given data sample is to be split into. So, they are often called k-fold cross-validation. The data is divided into training, validating and testing set to prevent data leaks. So the testing set should only be transformed after model is fit using the train and validation set. Each time the model fits the train data they are evaluate with the test data and the average of evaluation score is used to analyze the overall model.

Image by Author

GridSearchCV

Grid Search is one of the most basic hyper parameter technique used and so their implementation is quite simple. All possible permutations of the hyper parameters for a particular model are used to build models. The performance of each model is evaluated and the best performing one is selected. Since GridSearchCV uses each and every combination to build and evaluate the model performance, this method is highly computational expensive. The python implementation of GridSearchCV for Random Forest algorithm is as below.

# Run GridSearch to tune the hyper-parameter
from sklearn.model_selection import GridSearchCV
rfr=RandomForestRegressor()
k_fold_cv = 5 # Stratified 5-fold cross validation
grid_params = {
“n_estimators” : [10,50,100],
“max_features” : [“auto”, “log2”, “sqrt”],
“bootstrap” : [True, False]
}
grid = GridSearchCV(rfr, param_grid=grid_params, cv=k_fold_cv,
n_jobs = 1, verbose = 0, return_train_score=True)
grid.fit(X_train, y_train)
print(‘Best hyper parameter:’, grid.best_params_)

If you notice the grid_params, there are three values each for the hyper parameter n_estimators and max_features. So, there will be 3 x 3 = 9 combinations of these two hyper parameter alone.

Image by Author

So all the permutations of the hyper parameters will generate a huge number of models and as the data size increases the computational speed drastically drops. This is why data scientists prefer RandomizedSearchCV over GridSearchCV while dealing with huge data sets.

RandomizedSearchCV

In randomizedsearchcv, instead of providing a discrete set of values to explore on each hyperparameter, we provide a statistical distribution or list of hyper parameters. Values for the different hyper parameters are picked up at random from this distribution. The python implementation of GridSearchCV for Random Forest algorithm is as below.

# Run RandomizedSearchCV to tune the hyper-parameter
from sklearn.model_selection import RandomizedSearchCV
rfr=RandomForestRegressor()
k_fold_cv = 5 # Stratified 5-fold cross validation
params = {
“n_estimators” : [10,50,100],
“max_features” : [“auto”, “log2”, “sqrt”],
“bootstrap” : [True, False]
}
random = RandomizedSearchCV(rfr, param_distributions=params, cv=k_fold_cv,
n_iter = 5, scoring=’neg_mean_absolute_error’,verbose=2, random_state=42,
n_jobs=-1, return_train_score=True)
random.fit(X_train, y_train)
print(‘Best hyper parameter:’, random.best_params_)

We can conclude that the GridSearchCV is suitable only for small datasets. When it comes to larger dataset, RandomizedSearchCV outperforms GridSearchCV.

Hope you got some insights from the article. Follow for more!

You may also like:

--

--

Vishnu Satheesh
Analytics Vidhya

Big fan of data,cloud and AI. 3+ years of experience in data science. Completed Masters in Business Analytics at National University of Ireland, Galway.