Finding the optimal parameters in a RF model using pipes for code reuse

Jonathan Loscalzo
Hexacta Engineering
5 min readApr 16, 2020

This story is the second part of this one, where we select a model after evaluating many results.

After selecting a model to train, we have to select which hyperparameters perform better, train a final model and submit our predictions.

How can we select the best hyperparameters? The two main methods from scikit-learn are: GridSearch and RandomizedSearch.

GridSearch: exhaustive search

In GridSearch, we generate hyperparameters candidates from a grid of parameters, and internally it test all possible combinations of them. Once all hyperparameters are tried out, the model which achived the best objective metric (in this case, accuracy) with its hyperparameters is selected.

RandomizedSearch: is random well?

On the other hand, in RandomizedSearch all hyperparameters aren’t tried out. RandomizedSearch expects a parameter n_itera which is the number of parameter settings that are evaluated. What is the difference? Usually, RandomizedSearch use parameters as a distribution and in a random way, the algorithm selects some of them, it cost less time to finish the searching (because it is not trying out fully feature space).

According to Bergstra and Bengio in their paper, show that Random Search is more efficient for parameter optimization than Grid Search.

Regardless the approach we had selected, which is the catch? If the parameters space is wrong, you could waste a lot of time but you won’t find the correct parameters. For instance, in a RandomForest model, you could search values for n_estimator between 100 and 500, but the correct value might be 750.

In fact another advanced, though expensive, techniques exist. To name a few: Bayesian Optimization, Evolutionary optimization.

We selected GridSearch, the easiest. Though remember we could pivot or test another to figure out a better combination which gives us a better result.

5. Tuning Hyperparameters: GridSearch

After selecting RandomForest model and our searching technique, we are going to figure out the best hyperparameters that adjust to our problem. This task takes a long time to compute if parameter space is too broad: the greater our searching space was, the longer time it will take.

Most probably in the future, we could improve this solution pivoting to a parallelized approach or use a mixed technique between RandomizedSearch GridSearch.

As you remember from our first post, we use a pipeline for preprocessing and training a classifier:

To train all model combinations, execute a similar code:

As you can see above, GridSearch evaluates hyperparameters over the pipeline. Remember that our pipeline was built by other pipes, and these are named as clf and preprocess. GridSearch have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object, in this case a pipeline.
So, if we had wanted to find hyperparameters for our preprocessing pipe, we could have used that technique.
But how many combinations do we have?

len(ParameterGrid(params_rf)) # 432 combinations!

Of course, that is a lot of time… As a homework, we challenge you to find a the best hyperparameter combination, as a trick you could reduce the searching parameter space, use some tool like tune, or test another hyperparameter tuning technique.

StratifiedKFold is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each target class.
We set refit as accuracy because we want to find the best hyperparameters which optimize accuracy and not f1-score, in this case accuracy is our objective metric.

After finished, gs_rf return best_parameters_, a dictionary with the hyperparameters that were selected by the exhaustive search.

6. Evaluate with selected hyperparameters

We could use hold-out or k-fold cross-validation methods for evaluating these hyperparameters.

Hold out is when we split up the dataset in train-test sets (sometimes train-valid-test sets). We fit with train set, and evaluate with test set.

K fold Cross-validation is when we split up our dataset in k-folds. Then, we train a model with k-1 folds, and evaluating each model with 1-fold. The process is repeted on each fold, so in the end we have k models trained.

We could choose to hold-out when we have a large dataset or start building a model. CV is usually the preferred because it trains k models, and perform better on unseen data.

X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.33,
stratify=y # stratify by target labels
)

If you remember, we had already chosen hold-out, because we fit the grid search with X_train and y_train.

Then we print a report with the model fitted with train data, and evaluated with test data:

As we can see, functional needs repair is imbalanced. This is because the support is lower in comparision to functional and non-functional. Our model cannot perform as well as the other labels because it does not generalize well over this target.

The accuracy is rounded at 80% on unseen data (variance) and 95% on train set (bias). What might be happening here is that our model could be overfitted. Please see bias-variance tradeoff and learning-curve.

7. Train with whole dataset! Final Model.

At this moment, we fit the final model which is composed of the preprocessing pipeline and the RandomForest model, then we predict and submit the results to the competition.

rf_final = ( 
get_pipeline_model(RandomForestClassifier())
.set_params(**rf_params,)
)
rf_final = rf_final.fit(X,y)predictions = rf_final.predict(datatest)predictions = y_transformer.inverse_transform(predictions)now_str = datetime.now().strftime(‘%Y-%m-%d_%H-%M-%S’)pd.DataFrame(
predictions,
index=datatest.index,
columns=[‘status_group’]
).to_csv(f’../data/results/results-rf-{now_str}.csv’)

We would expect that our model obtain roughly 80%, as we saw in the test report.

Submit the file and well-done! we obtained a good result.

although, It will change in the future

As a homework, we challenge you to find the best hyperparameter combination and submit your predictions to the competition!

First article is found here: Choosing between ML models using pipes for code reuse. If you want read more, these articles are based on this and this. The full code is here.

--

--

Jonathan Loscalzo
Hexacta Engineering

Proactive, Developer & Student. Interested in Software Craftsmanship & DataScience