Different types of Hyper-Parameter Tuning.
This article aims to show python implementation for different Hyperparameter Tuning techniques using the RandomForest model.
Contents:
→ Importance of Hyper-Parameter Tuning!
→ Hyperparameter Tuning/Optimization
→ Defining Functions
→ Checking Performance on Base Model
→ Different Hyperparameter Tuning Methods
1. GridSearch
2. RandomSearch
3. Successive Halving
4. Bayesian Optimizers
5. Manual Search
→ Difference between Parameters and Hyperparameters
→ Conclusion
Hyperparameters are the soul of any model present in today’s ML world. The values of Hyperparameters needs to be passed manually as they cannot be learned, which then controls the whole Learning Process.
Hyperparameters are needed to be set before fitting the data in order to get a more robust and optimized model.
Importance of Hyper-Parameter Tuning!
- The goal of any model is to achieve a minimum error, Hyperparameters help achieve that as they are responsible for the outcome of any ML models.
- It influences the convergence of any ML Algorithm to a large extent.
Hyperparameter Tuning/Optimization
The process that involves the search of the optimal values of hyperparameters for any machine learning algorithm is called hyperparameter tuning/optimization.
I will use pulsar star data, You can download the data from the Kaggle Link.
Complete Code can be found in my GitHub repo.
Defining Functions
Function to evaluate Train Set.
def eval_model_train(model):
#defining a function to calculate the metrics on train data
pred = model.predict(x_train)
Precision = precision_score(y_train,pred)
Recall = recall_score(y_train,pred)
F1_Score = f1_score(y_train,pred)
return pred, Precision, Recall, F1_Score
Function to evaluate Test Set
def eval_model_test(model):
#defining a function to calculate the metrics on test data
pred = model.predict(x_test)
Precision = precision_score(y_test,pred)
Recall = recall_score(y_test,pred)
F1_Score = f1_score(y_test,pred)
return pred, Precision, Recall, F1_Score
Function to calculate time take
def exec_time(start, end):
diff_time = end - start
m, s = divmod(diff_time, 60)
h, m = divmod(m, 60)
s,m,h = int(round(s, 0)), int(round(m, 0)), int(round(h, 0))
return f"{h}:{m}:{s}"
Checking Performance on Base Model
→ Checking default Parameters of the RandomForest Base Model
Rf_model = RandomForestClassifier()
pprint(Rf_model.get_params())
start_base = time.time()
Rf_model.fit(x_train,y_train)
end_base = time.time()basemodel_time = exec_time(start_base,end_base)
basemodel_time
Performance on Train set
_, precision_basetrain, recall_basetrain, f1_basetrain =
eval_model_train(Rf_model)print("Precision = {} \nRecall = {} \nf1 = {}".format(precision_basetrain, recall_basetrain, f1_basetrain))
Performance on Test set
_, precision_basetest, recall_basetest, f1_basetest = eval_model_test(Rf_model)
print("Precision = {} \nRecall = {} \nf1 = {}".format(precision_basetest, recall_basetest, f1_basetest))
Different Hyperparameter tuning methods:
1. GridSearch:
- Grid search picks out hyperparameter values by combining each value passed in the grid to each other, evaluates every one of them, and returns the best.
- This leads to searching through the entire grid of the selected data.
- GridSearch may suffer from the Curse of Dimentionality, as more the parameters we pass, the more time and space will be taken by the parameters to perform the search.
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces(higher feature count)that do not occur in low-dimensional spaces(lower feature count).
This means the more dimensions we add, the more the search will increase in time complexity, ultimately making this strategy inconvenient.
providing a dictionary of hyperparameters
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 1500, num = 3)]# Number of features to consider at every split
max_features = ['auto', 'sqrt']# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 80, num = 3)]# Minimum number of samples required to split a node
min_samples_split = [2, 10, 15]# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 4, 9]# Method of selecting samples for training each tree
bootstrap = [True, False]para = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}pprint(para)#print our grid of hyperparameter values
OUTPUT:
Now, We fit the GridSearch model to find the set of optimal hyperparameter values.
The model will try out 324 combinations of hyperparameters.This gives you an idea of how grid search increases the Time Complexity.
2 of bootstrap
3 of max_depth
2 of max_features
3 of min_samples_leaf
3 of min_samples_split
3 of n_estimators
which gives a combination 2*3*2*3*3*3 = 324
start_gridsearch = time.time()grid_search = GridSearchCV(estimator = Rf_model,
param_grid = para,
scoring = "f1",
cv = 5, n_jobs = -1, verbose = 1)# Fit the grid search model
grid_search.fit(x_train,y_train)end_gridsearch = time.time()gridsearchmodel_time = exec_time(start_gridsearch,end_gridsearch)
gridsearchmodel_time
grid_search.best_params_ #outputs the set of best hyperparameter values.
OUTPUT:
Performance on Train Set
_, precision_gridtrain, recall_gridtrain, f1_gridtrain =
eval_model_train(grid_search)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_gridtrain, recall_gridtrain, f1_gridtrain))
Performance on Test Set
_, precision_gridtest, recall_gridtest, f1_gridtest =
eval_model_test(grid_search)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_gridtest, recall_gridtest, f1_gridtest))
2. RandomSearch:
- Random Search removes the exhaustive search done by GridSearch by combining the values randomly.
- Since the selection of parameters is completely random; it yields high variance during computing.
- For example,
Instead of checking all100 samples,RandomSearch checks 50 random parameters. - However, There is a trade-off to decreasing the time complexity. It is good at testing a wide range of values and normally it reaches a very good combination very fast, but the problem is that it doesn’t guarantee to give the best parameters combination.
Using the same dictionary of hyperparameters
Now,we fit the RandomSearch Model.This will take some time to execute.Depending on the size of the data.
Note:
→ The most important arguments in RandomizedSearchCV are n_iter, it handles the number of different combinations of data to try.
→ cv which is the number of folds to use for cross validation.Increasing cv folds reduces the chances of overfitting, but will increase the run time.
start_randomsearch = time.time()random_search = RandomizedSearchCV(estimator = Rf_model, param_distributions = para, cv = 5, verbose=1, random_state=42, scoring = "f1", n_jobs = -1)
# Fit the random search model
random_search.fit(x_train,y_train)end_randomsearch = time.time()randomsearchmodel_time = exec_time(start_randomsearch,end_randomsearch)
randomsearchmodel_time
random_search.best_params_
OUTPUT:
Performance on Train Set
_, precision_randtrain, recall_randtrain, f1_randtrain =
eval_model_train(random_search)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_randtrain, recall_randtrain, f1_randtrain))
Performance on Test Set
_, precision_randtest, recall_randtest, f1_randtest =
eval_model_test(random_search)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_randtest, recall_randtest, f1_randtest))
3. Successive Halving:
Scikit-learn also provides the HalvingGridSearchCV
and HalvingRandomSearchCV
estimators that can be used to search a parameter space using successive halving
- Successive halving (SH) is like a tournament among candidate parameter combinations.
- SH is an iterative selection process where all candidates (the parameter combinations) are evaluated with a small amount of resources at the first iteration.
- Only some of these candidates are selected for the next iteration, which will be allocated more resources.
- For parameter tuning, the resource is typically the number of training samples, but it can also be an arbitrary numeric parameter such as
n_estimators
in a random forest.
3.1 — Halving GridSearch
Using the same dictionary of hyperparameters
start_halvinggrid = time.time()Halving_grid_search = HalvingGridSearchCV(estimator = Rf_model, param_grid = para, cv = 5, verbose=1, random_state=42, n_jobs = -1)
# Fit the random search model
Halving_grid_search.fit(x_train,y_train)end_halvinggrid = time.time()
halvinggridmodel_time = exec_time(start_halvinggrid,end_halvinggrid)
halvinggridmodel_time
Checking Best Parameters
Halving_grid_search.best_params_
Performance on Train Set
_, precision_halvinggridtrain, recall_halvinggridtrain, f1_halvinggridtrain = eval_model_train(Halving_grid_search)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_halvinggridtrain, recall_halvinggridtrain, f1_halvinggridtrain))
Performance on Test Set
_, precision_halvinggridtest, recall_halvinggridtest, f1_halvinggridtest = eval_model_test(Halving_grid_search)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_halvinggridtest, recall_halvinggridtest, f1_halvinggridtest))
3.2 — Halving RandomSearch
Using the same dictionary of hyperparameters
start_halvingrandom = time.time()Halving_random_search = HalvingRandomSearchCV(estimator = Rf_model, param_distributions = para, cv = 5, n_jobs = -1, verbose = 1, )
# Fit the grid search model
Halving_random_search.fit(x_train,y_train)end_halvingrandom = time.time()
halvingrandommodel_time = exec_time(start_halvingrandom,end_halvingrandom)
halvingrandommodel_time
Checking Best Parameters
Halving_random_search.best_params_
Performance on Train Set
_, precision_halvingrandtrain, recall_halvingrandtrain, f1_halvingrandtrain = eval_model_train(Halving_random_search)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_halvingrandtrain, recall_halvingrandtrain, f1_halvingrandtrain))
Performance on Test Set
_, precision_halvingrandtest, recall_halvingrandtest, f1_halvingrandtest = eval_model_test(Halving_random_search)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_halvingrandtest, recall_halvingrandtest, f1_halvingrandtest))
Complete Code can be found in my GitHub repo.
4. Bayesian Optimizers:
4.1 — Hyperopt
Hyperopt is a Python library for serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions.
Defining Search Space
space = {
"n_estim ators": hp.choice("n_estimators",[200, 850 ,1500]),
"max_depth": hp.quniform("max_depth", 10, 80,5),
"max_features": hp.choice("max_features", ["auto", "sqrt"]),
"min_samples_split":hp.choice("min_samples_split",[2, 10, 15]),
"min_samples_leaf":hp.choice("min_samples_leaf",[1, 4, 9]),
"bootstrap": hp.choice("bootstrap",[True,False])
}
Defining Function to minimize
def tune_random(params):
rand = RandomForestClassifier(**params,n_jobs=-1)
score = cross_val_score(rand,
x_train,y_train,scoring="f1",cv=5).mean()
return {"loss": score, "status": STATUS_OK}
Minimizing the function
start_hpot = time.time()trials = Trials()best = fmin(
fn=tune_random,
space = space,
algo=tpe.suggest,
max_evals=100,
trials=trials
)end_hpot = time.time()hpotmodel_time = exec_time(start_hpot,end_hpot)
hpotmodel_time
Checking Best Parameters
print("Best: {}".format(best))
Fitting Base Model with the set of best Parameters
rf_hyperopt = RandomForestClassifier(n_estimators=200,
max_depth=35,
max_features='auto',
min_samples_split=10,
min_samples_leaf=9,
bootstrap = True).fit(x_train,y_train)
Performance on Train Set
_, precision_hptrain, recall_hptrain, f1_hptrain = eval_model_train(rf_hyperopt)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_hptrain, recall_hptrain, f1_hptrain))
Performance on Test Set
_, precision_hptest, recall_hptest, f1_hptest = eval_model_test(rf_hyperopt)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_hptest, recall_hptest, f1_hptest))
4.2 — Optuna
- Eager dynamic search spaces
- Efficient sampling and pruning algorithms
- Easy integration
- Good visualizations
- Distributed optimization
Defining Function
def objective(trial):
n_estimators = trial.suggest_int("n_estimators",200,1500)
max_features = trial.suggest_categorical("max_features",["auto","sqrt"])
max_depth = trial.suggest_int("max_depth",10,80,log = True)
min_samples_split = trial.suggest_int("min_samples_split",2,15)
min_samples_leaf = trial.suggest_int("min_samples_leaf",1,9)
bootstrap = trial.suggest_categorical("bootstrap",[True,False])
rand = RandomForestClassifier(n_estimators=n_estimators,max_features=max_features,
max_depth=max_depth,min_samples_leaf = min_samples_leaf,
min_samples_split = min_samples_split,
bootstrap = bootstrap)score_cr = cross_val_score(rand,
x_train,
y_train,
n_jobs = -1,
cv=5,
scoring='f1')
score = score_cr.mean()return score
Creating Study
study = optuna.create_study(direction='minimize')
Minimizing the Function
start_optuna = time.time()optuna.logging.set_verbosity(optuna.logging.WARNING)
study.optimize(objective, n_trials=100)end_optuna = time.time()optunamodel_time = exec_time(start_optuna,end_optuna)
optunamodel_time
Checking Best Parameters
for key, value in study.best_trial.params.items():
print(f'{key}: {value}')
Fitting Base Model with the set of best Parameters
rf_optuna = RandomForestClassifier(**study.best_trial.params).fit(x_train,y_train)
Performance on Train Set
_, precision_opttrain, recall_opttrain, f1_opttrain = eval_model_train(rf_optuna)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_opttrain, recall_opttrain, f1_opttrain))
Performance on Test Set
_, precision_opttest, recall_opttest, f1_opttest = eval_model_test(rf_optuna)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_opttest, recall_opttest, f1_opttest))
Plotting Optimization History
optuna.visualization.plot_optimization_history(study)
4.5 — Scikit-Optimize
- Sequential model-based optimization
- Built on NumPy, SciPy, and Scikit-Learn
- Open source, commercially usable
Skopt:
Defining Search Space
space = [
Integer(200,1500,name = "n_estimators"),
Integer(10, 80, name = "max_depth"),
Categorical(["auto", "sqrt"], name = "max_features"),
Integer(2,15, name = "min_samples_split"),
Integer(1,9, name = "min_samples_leaf"),
Categorical([True,False], name = "bootstrap")
]
Defining Objective Function to minimize
@use_named_args(space)# this wrapper/decorater uses the name we passed for the parameterdef objective(**params):
Rf_model.set_params(**params)return cross_val_score(Rf_model,
x_train,
y_train,
cv=5,
n_jobs=-1,
scoring="f1").mean()
Minimizing the Objective Function
start_skopt = time.time()tune_rand_gp = gp_minimize(objective,space,random_state=1234)end_skopt = time.time()skoptmodel_time = exec_time(start_skopt,end_skopt)
skoptmodel_time
Checking Best Parameters
print(f"Best parameters: \n")
print(f'n_estimators={tune_rand_gp.x[0]}')
print(f'max_depth={tune_rand_gp.x[1]}')
print(f'max_features={tune_rand_gp.x[2]}')
print(f'min_samples_split={tune_rand_gp.x[3]}')
print(f'min_samples_leaf={tune_rand_gp.x[4]}')
print(f'bootstrap = {tune_rand_gp.x[5]}')
Fitting Base Model with the set of best Parameters
rf_skopt = RandomForestClassifier(n_estimators=200,
max_depth=67,
max_features='sqrt',
min_samples_split=2,
min_samples_leaf=9,
bootstrap = True).fit(x_train,y_train)
Performance on Train Set
_, precision_sktrain, recall_sktrain, f1_sktrain = eval_model_train(rf_skopt)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_sktrain, recall_sktrain, f1_sktrain))
Performance on Test Set
_, precision_sktest, recall_sktest, f1_sktest = eval_model_test(rf_skopt)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_sktest, recall_sktest, f1_sktest))
Plotting Convergence Graph
plot_convergence(tune_rand_gp)
4.4 — BayesSearchCV
As of now BayesSearchCV is not compatible with sklearn 0.24 version.
To use BayesSearch downgrade sklearn to 0.23.2
Defining Search Space
param_bayes = {
"n_estimators": Integer(200,1500),
"max_depth": Integer(10, 80),
"max_features": Categorical(["auto", "sqrt"]),
"min_samples_split": Integer(2,15),
"min_samples_leaf": Integer(1,9),
"bootstrap": Categorical([True,False])
}
fitting the bayessearchCV
bayes_rf = BayesSearchCV(Rf_model,
search_spaces = param_bayes,
cv = 5,
scoring="f1",
refit=True)start_bayes = time.time()bayes_rf.fit(x_train, y_train)end_bayes = time.time()bayesmodel_time = exec_time(start_bayes,end_bayes)
bayesmodel_time
Checking Best Parameters
bayes_rf.best_params_
Performance on Train Set
_, precision_bayestrain, recall_bayestrain, f1_bayestrain = eval_model_train(bayes_rf)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_bayestrain, recall_bayestrain, f1_bayestrain))
Performance on Test Set
_, precision_bayestest, recall_bayestest, f1_bayestest = eval_model_test(bayes_rf)print("Precision = {} \n Recall = {} \n f1 = {}".format(precision_bayestest, recall_bayestest, f1_bayestest))
Plotting Objective
bayes_rf_plot = plot_objective(bayes_rf.optimizer_results_[0],
dimensions=["n_estimators", "max_depth", "max_features", "min_samples_split", "min_samples_leaf", "bootstrap"],
n_minimum_search=int(1e8))
plt.show()
5. Manual Search:
- Manual Search can be done on the basis of our judgment/experience.
- We train the model based on the random values that we assigned manually, evaluate its accuracy and start the process again.
- This loop is repeated until a satisfactory accuracy is scored.
Difference between Parameters and Hyperparameters
→ Model Parameters: These are learnt when the model is running and recognizing the data.
Model Parameters differ from experiment to experiment and completely depends on the type of data passed and the task being solved.
Some examples of model parameters include:
- The weights in an artificial neural network(ANN).
- The support vectors in a support vector machine.
- The coefficients in linear regression or logistic regression.
- For NLP task: word frequency, sentence length, noun or verb distribution per sentence, the number of specific character n-grams per word, lexical diversity, etc.
→ Hyperparameters: These are the values that your model expects to be passed to obtain the optimal performance on any given data, for any task.
Some examples of model hyperparameters include:
- The learning rate for training a neural network.
- The C and sigma hyperparameters for support vector machines.
- The k in k-nearest neighbors.
- Depth of tree in Decision trees
Difference between Parameters and Hyperparameters
Conclusion
After using all the different methods and creating a dataframe from the results, so that we can compare each of the techniques.
models = ['RandomForest', 'RandomForest_gridsearch',
'RandomForest_randomsearch', 'RandomForest_Halvinggrigd',
'RandomForest_Halvingrandom', 'RandomForest_hyperopt',
'RandomForest_optuna', 'RandomForest_skopt',
'RandomForest_bayes']model_time = [basemodel_time, gridsearchmodel_time,
randomsearchmodel_time, halvinggridmodel_time,
halvingrandommodel_time, hpotmodel_time,
optunamodel_time, skoptmodel_time, bayesmodel_time]model_precision_train = [precision_basetrain, precision_gridtrain,
precision_randtrain,
precision_halvinggridtrain,
precision_halvingrandtrain,
precision_hptrain, precision_opttrain,
precision_sktrain, precision_bayestrain]model_recall_train = [recall_basetrain, recall_gridtrain,
recall_randtrain, recall_halvinggridtrain,
recall_halvingrandtrain, recall_hptrain,
recall_opttrain, recall_sktrain,
recall_bayestrain]model_f1_train = [f1_basetrain, f1_gridtrain, f1_randtrain,
f1_halvinggridtrain,
f1_halvingrandtrain, f1_hptrain, f1_opttrain,
f1_sktrain, f1_bayestrain]model_precision_test = [precision_basetest, precision_gridtest,
precision_randtest,
precision_halvinggridtest,
precision_halvingrandtest, precision_hptest,
precision_opttest, precision_sktest,
precision_bayestest]model_recall_test = [recall_basetest, recall_gridtest,
recall_randtest, recall_halvinggridtest,
recall_halvingrandtest, recall_hptest,
recall_opttest, recall_sktest,
recall_bayestest]model_f1_test = [f1_basetest, f1_gridtest, f1_randtest,
f1_halvinggridtest,
f1_halvingrandtest, f1_hptest, f1_opttest,
f1_sktest, f1_bayestest]comp_dict = {"models":models,
"model_time":model_time,
"model_precision_train":[round(i,3) for i in model_precision_train],
"model_precision_test":[round(i,3) for i in model_precision_test],
"model_recall_train":[round(i,3) for i in model_recall_train],
"model_recall_test":[round(i,3) for i in model_recall_test],
"model_f1_train":[round(i,3) for i in model_f1_train],
"model_f1_test":[round(i,3) for i in model_f1_test]}comparison = pd.DataFrame(comp_dict)
comparison
→ Sorting with respect to f1 score on test
comparison.set_index('models').sort_values('model_f1_test', ascending = False).head(3)
→ Sorting with respect to difference between f1 score on train and test
comparison['Diff_f1_train_test'] = np.abs(comparison['model_f1_train'] - comparison['model_f1_test'])
comparison.set_index('models').sort_values('Diff_f1_train_test').head(3)
After, Sorting the Values with Respect to F1 score for train and set test, it turns out that bayesian techniques worked the best.
However, in the production environment we not only have to get the best result, but also, as quickly as possible and with respect to that RandomSerach performed the best.
Complete Code can be found in my GitHub repo.