Hyperparameter Optimization for 🤗Transformers: A guide

Published in

Distributed Computing with Ray

8 min readAug 26, 2020

By Amog Kamsetty, Kai Fricke, Richard Liaw

Overview of fine-tuning a pre-trained model. Two new fully connected layers are appended to the pre-trained Transformer network. Since we leverage existing knowledge of the pre-trained model, only a few training epochs are needed. Hyperparameters are provided to the model and optimizer which have a significant impact on training.

Training NLP models from scratch takes hundreds of hours of training time. Instead, it’s much easier to use a pre-trained model and fine-tune it for a certain task. Using the Hugging Face transformers library, we can easily load a pre-trained NLP model with several extra layers, and run a few epochs of fine-tuning on a specific task.

But what hyperparameters should we use for this fine-tuning?

Although a single fine-tuning training run is relatively quick, having to repeat this with different hyperparameter configurations ends up being pretty time consuming. Pretty much everyone (1, 2, 3, 4), including the original BERT authors, either end up disregarding hyperparameter tuning or just doing a simple grid search over just a few different hyperparameters with a very limited search space.

In this blog post, we’ll show that basic grid search is not the most optimal, and in fact, the hyperparameters we choose can have a significant impact on our final model performance.

We fine-tune BERT using more advanced search algorithms like Bayesian Optimization and Population Based Training. As a result, we can

gain a better understanding of our hyperparameters and
train a model with 5% better accuracy in the same amount of time.

We also conclude with a couple tips and tricks for hyperparameter tuning for 🤗 Transformer models.

To reproduce these results for yourself, you can check out our Colab notebook leveraging Hugging Face transformers and Ray Tune!

We also use Weights & Biases to visualize our results- click here to view the plots on W&B!

Comparing Hyperparameter Optimization Strategies

We compare 3 different optimization strategies — Grid Search, Bayesian Optimization, and Population Based Training — to see which one results in a more accurate model in less amount of time. You can learn more about these different strategies in this blog post or video.

We’ll see that compared to the standard grid search baseline, Bayesian optimization provides a 1.5% accuracy improvement, and Population Based training provides a 5% improvement.

Experiment Setup

We use a standard uncased BERT model from Hugging Face transformers and we want to fine-tune on the RTE dataset from the SuperGLUE benchmark. Since we don’t have access to the labels for the test set, we split the dev set in half and use one for validation and the other for testing.

All of the experiments below are run on a single AWS p3.16xlarge instance which has 8 NVIDIA V100 GPUs. We use the Ray Tune library in order to easily execute multiple runs in parallel and leverage different state-of-the-art tuning algorithms with minimal code changes.

Setting a Baseline with Grid Search

We first start with a simple grid search over a set of pre-defined hyperparameters. We use the search space recommended by the BERT authors:

We run a total of 18 trials, or full training runs, one for each combination of hyperparameters.

Results and configurations for best 5 Grid Search trials. Click on the image to play around with it on W&B!

Out of these trials, the final validation accuracy for the top 5 ranged from 71% to 74%. Taking the best configuration, we get a test set accuracy of 65.4%. The results are summarized below:

Best validation accuracy = 74%
Best run test set accuracy = 65.4%
Total # of GPU min: 5.66 min * 8 GPUs = 45 min
Total cost: 5.66 min * $24.48/hour = $2.30

Improving Grid Search with Bayesian Optimization

The simple grid search did alright, but it had a very limited search space and only considered 3 hyperparameters. What if there was a much better configuration that exists that we aren’t searching over? Although it only took ~6 minutes to run the 18 trials above, every new value that we want to search over means 6 additional trials. And this gets amplified even further if we want to tune over even more hyperparameters!

Instead, a more advanced approach is Bayesian Optimization. Here, we fit a Gaussian Process model that tries to predict the performance of the parameters (i.e. the loss), and is used to inform future hyperparameters.

We also combine this with an early stopping algorithm, Asynchronous Hyperband, where we stop bad performing trials early to avoid wasting resources on them. This way we can start more runs in parallel and thus test a larger number of hyperparameter configurations.

Example of Bayes Opt.+Early Stopping flow for a single concurrent trial. We start training with random hyperparameters, and after every epoch, terminate if it’s not performing well. After full training is done, we update our Bayesian Optimizer, and start a new trial with new suggested hyperparameters.

For this experiment, we also search over weight_decay and warmup_steps, and extend our search space:

We run a total of 60 trials, with 15 of these used for initial random searches.

The top few runs get a validation accuracy ranging from 72% to 77%. Overall, compared to basic grid search, we have more runs with good accuracy.

On our test set, we pick the best configuration and get an accuracy of 66.9%, a 1.5 percent improvement over the best configuration from grid search. The experiment took a total of ~13 min to run, and while this is longer than grid search, we ran a total of 60 trials and searched over a much larger space.

Best validation accuracy = 77% (+ 3% over grid search)
Best run test set accuracy = 66.9% (+ 1.5% over grid search)
Total # of GPU hours: 13 min * 8 GPU = 104 min
Total cost: 13 min * 24.48/hour = $5.30

Analyzing the most important hyperparameters for BERT fine-tuning

Because Bayesian Optimization tries to model our performance, we can examine which hyperparameters have a large impact on our objective, called feature importance. Interestingly, we see that weight_decay is the second most important hyperparameter, showing the importance of searching over more hyperparameters.

*Hyperparameters for our experiment ranked by importance. Click on the image to play around with this on W&B!*

We can also see below that our best trials are mostly created towards the end of the full experiment, showing that our hyperparameter configurations get better as time goes on and our Bayesian optimizer is working.

Relationship between time trial was created and final validation accuracy. We see that most of our good performing trials are created towards the end of the full experiment run. Click on the image to play around with this on W&B!

Leveraging Population Based Training to Maximize Accuracy

With Bayesian Optimization, we were able to leverage a guided hyperparameter search. But even though we stopped poor performing trials early, subsequent trials would start training from scratch.

Instead, Population Based Training still uses guided hyperparameter search, but doesn’t need to restart training for new hyperparameter configurations. Instead of just discarding bad performing trials, we exploit good performing runs by copying their network weights and hyperparameters and then explore new hyperparameter configurations, while still continuing to train.

Overview of Population Based Training. At some predefined interval, bad performing trials copy the state of good performing trials, randomly perturb the cloned hyperparameters, and then continue to train. Image from *Deepmind Blog*

The search space we use for this experiment is as follows:

We run only 8 trials, much less than Bayesian Optimization since instead of stopping bad trials, they copy from the good ones.

The top 5 trials have a validation accuracy ranging from 75% to 78%, and none of the 8 trials have a validation accuracy less than 70%. We pick the best configuration and get a test set accuracy of 70.5%. The whole experiment took ~6 min to run, which is roughly on par with our basic grid search.

Best validation accuracy = 78% (+ 4% over grid search)
Best run test set accuracy = 70.5% (+ 5% over grid search)
Total # of GPU hours: 6 min * 8 GPU = 48 min
Total cost: 6 min * 24.48/hour = $2.45

Summary of Results

Comparison of 3 different hyperparameter tuning approaches.

Tips & Tricks

The key takeaway here is that Population Based Training is the most effective approach to tune the hyperparameters of the Transformer model. However, here are a few other insights that we uncovered about hyperparameter tuning for NLP models that might be of broader interest:

Avoiding local minima with Bayesian Optimization: When using a Bayesian Optimization method, it is important to provide an initial set of “random guesses”. Intuitively, this provides a more informative prior for the Bayesian Optimization to start with. Otherwise, the Optimizer can be myopic and overfit to a small number of samples.
Cutting down iteration time is super important: Always make sure that you utilize all of the compute resources available on our machine. Anything that can be run in parallel should be run in parallel.
Tweaking the perturbation/mutation interval for PBT: With PBT, an important consideration is the perturbation interval, or how frequently we want to exploit and explore our hyperparameters. For our experiments, we performed this mutation after every epoch. However, doing this too frequently is actually counterproductive since model performance is noisy if only trained for a few batch steps.
Random seeds also factor into our accuracy results. In addition to tuning the hyperparameters above, it might also be worth sweeping over different random seeds in order to find the best model. A two step approach could work best here: First use an early stopping algorithm to train over many different seeds, and then selecting just the best performing seeds, use Population Based Training to tune the other hyperparameters.

Do it yourself — Implementation of our Approach

You can check out our implementation of Population Based Training in this Colab Notebook. With Ray Tune we can easily implement scalable PBT without much modification to our standard fine-tuning workflow.

Summary & Outlook

Hopefully this blog post inspires you to consider optimizing hyperparameters more when training your models. And as you can see, hyperparameter tuning a transformer model is not rocket science.

And this is just the start. The Ray libraries offer a host of features and integrations. If you’re inclined to try this out on a multi-node cluster, feel free to give the Ray Cluster Launcher a shot to easily start up a cluster on AWS.

Check here for the full code examples. And if you want to try out any of the other algorithms or features from Tune, we’d love to hear from you either on our GitHub or Slack!

To learn more about how researchers and companies use Ray to tune their models in production, join us at the upcoming Ray Summit!