Using Large Language Models for Hyperparameter Optimization

6 min readJan 26, 2024

This image was created with the assistance of DALL·E 2

Introduction

Hyperparameter optimization (HPO) is an essential step in machine learning model training to improve the performance of the model. Hyperparameters are fixed parameters that determine the structure of a model and need to be set prior to the training. In contrast to model parameters, they cannot be learned from the data. Examples include the strength of regularization, the learning rate or the number of hidden layers in a neural network.

There are many algorithms of varying complexity to optimize hyperparameters. A simple but potentially expensive approach is an extensive grid search where all possible parameter combinations of a given parameter space are tested. More sophisticated approaches that aim to reduce computational costs include grid search with halving, Bayesian Optimization, or a combination of different techniques. However, the skill and judgement of an experienced data scientist is an important asset for model tuning. Therefore, it seems tempting to replace this human resource with AI and let a Large Language Model (LLM) do the tuning.

Zhang et al. (2023) proposed a novel approach on how to leverage the power of an LLM to reduce the computational costs and runtime associated with hyperparameter optimization. They could demonstrate on a series of models and HPO benchmarks that LLMs can perform comparable or better than traditional HPO methods, at least in settings with a limited computational budget. In this article, I will build on their ideas and show how to use an LLM for HPO of a text classification model.

The Dataset

The problem at hand is the classification of unstructured text, namely the “20 newsgroups” dataset, which consists of 18,846 documents distributed over 20 classes. The number of characters per document ranges from 115 to 160,616 with a median of 1,175. As shown in Figure 1b, the dataset appears to be rather balanced. The training dataset includes 11,314 records and the test dataset includes 7,532. Here is an example from the category “motorcycle”.

From: irwin@cmptrc.lonestar.org (Irwin Arnstein)
Subject: Re: Recommendation on Duc
Summary: What's it worth?
Distribution: usa Expires: Sat, 1 May 1993 05:00:00 GMT Organization: CompuTrac Inc., Richardson TX
Keywords: Ducati, GTS, How much?
Lines: 13
I have a line on a Ducati 900GTS 1978 model with 17k on the clock.
Runs very well, paint is the bronze/brown/orange faded out, leaks a bit of oil
and pops out of 1st with hard accel. The shop will fix trans and oil leak.
They sold the bike to the 1 and only owner. They want $3495, and I am thinking
more like $3K. Any opinions out there? Please email me. Thanks. It would be
a nice stable mate to the Beemer. Then I'll get a jap bike and call myself
Axis Motors!
"Tuba" (Irwin) "I honk therefore I am" CompuTrac-Richardson,Tx irwin@cmptrc.lonestar.org DoD #0826

A classification model is trained on this dataset to predict the classes. But first, the text needs to be embedded to obtain a numeric representation of the text that the models can work with. Here, the “distiluse-base-multilingual-cased-v2” model with 512 embedding dimensions and taken from the transformers library, is used.

Figure 1) The “20 newsgroup”-dataset . a) The number of characters per document. b) The number of documents per class

The Experiment

The aim is to train two models on this classification task: a Support Vector Classifier (SVC) with two hyperparameters to optimize, and an XGBoost tree with 9 hyperparameters. The LLM is used as follows: in an initial query, the model setup is introduced to the LLM and asked for an initial set of hyperparameters.

"""You are helping tune hyperparameters for a SVM model.
Training is done with the sklearn library in python.
This is our hyperparameter search space:
C: float, - Regularization parameter.
The strength of the regularization is inversely proportional to C.
Must be strictly positive. The penalty is a squared l2 penalty.
Range: [0.001, 1000]
kernel {'poly', 'rbf', 'sigmoid'}  - Specifies the kernel type to be used in the algorithm. 
We have a budget to try 10 configurations in total.
You will get the validation error rate (1- accuracy) before you need
to specify the next configuration. 
The goal is to find the configuration that minimizes the error rate
with the given budget, so you should explore different parts of the
search space if the loss is not changing.

The model is trained with this set and evaluated by computing the validation error (1-accuracy) using the test split. In a follow-up query, the error is passed to the model and asked for the next set of parameters, which is evaluated in the same way and passed back to the LLM.

"""error = {error :.4e}. Specify the next config."""

This procedure is repeated until the search budget of 10 iterations is exhausted (see Figure 2)

Figure 2) A schematic representation of the experimental setup. (Image by the author, inspired by Zhang et al., 2023)

A random grid search with 10 iterations per epoch, with 20 epochs in total, serves as a benchmark. The smallest validation error per epoch is reported. The hyperparameters and their respective search spaces are listed in Table 1. GPT-4 Turbo is used as an LLM.

Table 1) Parameters and respective search spaces for the models SVM and XGBoost

Results

Figure 3 summarizes the results of the HPO of the SVC. The LLM starts off with an error of over 0.9 but quickly adjusts its suggestions to reach error rates around 0.2. It subsequently tries different configurations and narrows down the search space. For example, “rbf” is quickly identified as the most promising kernel candidate, and, except for iteration 8, the search is focused on the regularization strength (Figure 3a). This is arguably a reasonable strategy considering the limited search budget.

Comparing the results to a random search with 10 iterations also shows that the GPT-4 Turbo beats the random search in 16 out of 20 runs (Figure 3b).

Figure 3) HPO for a SVC model. a) Validation error rate per iteration of the hyperparameter set provided by GPT-4 Turbo. The optimized parameters kernel (top) and C (bottom) are printed for each iteration. b) Blue bars denote the best validation error rate per epoch from the random search. Black line indicates the best validation error rate of GPT-4 Turbo (minimum of the black line in a).

To ramp up the challenge, the LLM is now confronted with an XGBoost model and asked to tune nine hyperparameters instead of only two (Table 1). When running the LLM for 10 iterations as before, it beats a random search in 6 out of 20 epochs (Figure 4). The error rate decreases further when the experiment is continued for another 10 iterations, reaching a minimum error rate at iteration 19. The continued downward trend of the error rate over 20 iterations suggests that GPT-4 Turbo is capable of handling longer trajectories. After 20 iterations, it beats the Random Search in 50% of all epochs.

To add another benchmark, the result of the state-of-the-art AutoML library FLAML is also shown. It is limited to an XGBoost model and run for 15 minutes, which is roughly the time needed to run 10 iterations on the hardware used for the experiment (Apple M1 Pro with 16 GB Memory; the library does not allow to specify the number of iterations). The result of FLAML is comparable to the result of the GPT-4 Turbo and beats the random search in 9 out of 20 epochs.

HPO for a XGBoost model. a) Validation error rate per iteration of the hyperparameter set provided by GPT-4 Turbo. b) Blue bars denote the best validation error rate per epoch from the random search. Black lines indicates the best validation error rate of GPT-4 Turbo (minimum of the black line in a) after 10 (dashed) and 20 (solid) iterations. Gray line shows results of the AutoML library FLAML.

In summary, in settings with a very limited computational budget, GPT-4 Turbo is able to optimize a model with only a few hyperparameters to tune and outperform a random search. When increasing the complexity by adding model parameters, a random search outperforms the LLM in 50% of all cases. However, GPT-4 Turbo is still able to systematically reduce the validation error rate, even when the length of the trajectory is doubled to 20 iterations, and reach results comparable to a state-of-the-art AutoML library.

While relying solely on an LLM for HPO tasks is probably not (yet?) appropriate, including them in your HPO strategy might yield some benefit. For example, to provide initial and/or refined search in combination with Grid Search algorithms.

References and further reading

M. R. Zhang, N. Desai, J. Bae, J. Lorraine, J. Ba Using Large Language Models for Hyperparameter Optimization (2023), Foundation Models for Decision Making Workshop at NeurIPS 2023

Shuai Guo, When AutoML Meets Large Language Model, Towards Data Science

20 newsgroup dataset

The python code can be found here

Interested in text classification? Check out the article of my colleague Niklas von Moers to understand Why CNNs and Text Embeddings Don’t Mix.