Performing hyperparameter tuning using scikit-learn

Anya Pfeiffer
Jun 5 · 7 min read
Photo by Ousa Chea on Unsplash

Objectives

By the end of this piece, you should understand:

  • The difference between hyperparameters and parameters
  • Why hyperparameters should be tuned
  • The difference between grid and randomized searches

Key Terms: hyperparameter, parameter, hyperparameter tuning, grid search, randomized search


Getting Started

Before jumping into different methods of hyperparameter tuning, let’s make sure we understand what hyperparameters are, and by extension, why the process of hyperparameter tuning is important when creating machine learning models.

Hyperparameters are parameters that are not set within an algorithm or model, they are external. By contrast, parameters are the values that are estimated by the data. These are internal to your model and data and aren’t set by the programmer of the model. For a more in-depth description of the differences between parameters and hyperparameters, check out this article.

It’s good practice to adjust these parameters before beginning the learning process in order to improve the overall performance of a machine learning model. The process of tweaking these parameters to find the best levels for each parameter is hyperparameter tuning (sometimes called hyperparameter optimization). In this article, I’ll focus on grid search and randomized search, two widely used methods for hyperparameter tuning. After explaining the basics of each, we’ll look at some basic examples of implementation of these methods using sklearn.


The Data

In order to display these tuning methods in Python, I used a Breast Cancer Diagnostic dataset. If you’re interested in looking at the dataset, it can be found on Kaggle and is also available through the UCI Machine Learning Repository. The dataset contains a variety of measurements calculated from the cell nuclei of a given sample, as well as a patient ID number, and a column containing “M” if the tumor was malignant and “B” if the tumor was benign.

In order to prepare the data for hyperparameter tuning, I separated the features (the measurements and patient ID numbers) from the target (the diagnosis of malignant or benign). Additionally, I replaced all “M” values in the diagnosis column with 0, and all “B” values with 1 in order to make my model predictions simpler. Finally, there was a column of “unnamed” data, which only contained NaN values, in the file I downloaded from Kaggle. I dropped this column.


Grid Search

Grid search is a method for hyperparameter tuning that involves finding the optimal hyperparameter values by checking all parameter combinations based on a given model. Said another way, grid search essentially brute forces its way through all possible combinations of hyperparameters and saves the metrics for the combination with the best performance. As you can probably imagine, the larger the number of hyperparameters you’re optimizing, the longer that this method takes to run. That said, it’s useful because it’s exhaustive and leaves no stone unturned. This can also be used with any model. In these examples, I’ll use both a logistic regression model and a random forest classifier.

You can perform a grid search in python using sklearn.model_selection.GridSearchCV().

Check out the documentation here.

Before running the grid search, create an object for the model you want to use. Here, we’ll start with logistic regression. Create a list of penalties, L1 and L2, to be used as well as a set of C values and hyperparameters that take both the C values and the penalties into account. In this case, C values are meant to control the strength of regularization; smaller values lead to stronger regularization and larger values weaken the regularization.

Now we’re ready to perform the grid search. In addition to taking in the model and the hyperparameters, GridSearchCV also allows the user to specify the number of folds in cross-validation using cv=, as well as a verbosity measure, which generates more detail the higher it gets. For the purposes of this article, verbose will always equal 1. If you want to get the mean accuracy of the grid search, use .score(features, target).

In order to explore this a little bit further, I ran this multiple times using different values for cv, thereby increasing or decreasing the amount of cross-validation that was happening. Using a for loop, I created a list of the cv values and mean accuracies used by each iteration, and graphed the results in a scatterplot to see if there was any relationship between accuracy and cv value.

As you can see from this visualization, there isn’t a particularly strong relationship between these two things.

Next, I checked to see if the c value, which controls regularization, had any relationship with the cross-validation values. Similar to the graph above, there wasn’t much of a relationship there so I moved on to the last combination: regularization values vs. mean accuracy.

While this isn’t a super linear relationship, this graph does seem to reflect that a weaker regularization leads to a slightly more accurate score.

To recap grid search:

  • Advantages: exhaustive search, will find the absolute best way to tune the hyperparameters based on the training set
  • Disadvantages: time-consuming, danger of overfitting

Randomized Search

A randomized search provides an alternative to the exhaustive grid search method. As the name suggests, it randomly selects combinations of hyperparameters and tests them to find the optimal hyperparameter values out of the randomly selected group. This method is typically faster than a grid search since it doesn’t test the full range of possibilities. Additionally, since it isn’t exhaustive, a randomized search reduces the chance of a model overfitting to the training data. In some cases, a model tuned with randomized search is more accurate than a model tuned with grid search in the long run, especially if the model has a smaller number of important hyperparameters. One potential disadvantage to this method is that there is a high possibility of variance between runs since it’s random.

In Python, a randomized search exists in the same library as a grid search: sklearn.model_selection.RandomizedSearchCV().

You can check out the documentation for this here.

In order to begin the randomized search, you’ll need to create the model you want to run it on. Here, we’ll start with a logistic regression model again. Just like a grid search, the hyperparameters will need to be selected.

RandomizedSearchCV takes in a few parameters in addition to the ones we saw during the grid search. n_iter is the number of iterations, or samples, of hyperparameters that will be run. In this case, I’ll use 100 iterations. You also have the option to set a random_state. Now, run the randomized search:

Again, using the .score(features, target) call will return the mean accuracy of the model.

Similar to the grid search, I wanted to see if there was much of a relationship between cross-validation values, c values, and the mean accuracy values. Below are graphs visualizing the relationship (or lack thereof) between each of these features).

Above, you can see that mean accuracy and cross fold values don’t really have a visible relationship, and neither do c values and cross fold values.

As you can see below, in a randomized search there does seem to be a loose relationship between regularization values and mean accuracy values.

To recap randomized search:

  • Advantages: reduced chance of overfitting, much faster than grid search
  • Disadvantages: a lot of potential for variance, since it’s random

Conclusion

Just to recap what we’ve gone over:

  • Hyperparameters are parameters external to the learning process that are set before learning starts.
  • Hyperparameter tuning (sometimes called hyperparameter optimization) is the process by which the optimal values for different hyperparameters are selected. This helps improve the learning model.
  • Two popular methods for hyperparameter tuning are grid search and randomized search.
  • Grid search is thorough and will yield the most optimal results based on the training data — however, it does have some flaws: (1) it is time-consuming, depending on the size of your dataset and the number of hyperparameters. (2) it could lead to overfitting of the training set, leading to a less viable model in the long run.
  • Randomized search selects a random sampling of hyperparameter combinations, reduces the danger of overfitting, and is likely to provide more accurate long term results — especially when there are a smaller number of significant hyperparameters.

Hopefully, this article has provided clarity on why hyperparameter tuning is important and also given insight into two useful methods.

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade