An Introduction to the Implementation of Optuna, a Hyperparameter Optimization Framework
I am Masashi Shibata from the CyberAgent AI Lab (GitHub: @c-bata).
Hyperparameter optimization is one of the most important processes for a machine learning model to deliver high performance. Optuna  is a popular Python library for hyperparameter optimization, and is an easy-to-use and well-designed software that supports a variety of optimization algorithms. This article describes the internal implementations of Optuna, with a main focus on the software aspects.
In order to understand the internal implementations of Optuna, you need to know the roles of the major components and the overall flow of the execution. However, with Optuna actively being developed and the amount of code growing, it has become difficult to get a hold of the overall flow from reading the code. So, I created a tiny program called Minituna. Minituna has three versions with each having 100, 200, and 300 lines of code. The final version is a small program of only about 300 lines, but with a practical pruning algorithm, it is quite authentic.
Minituna: A toy hyperparameter optimization framework intended for understanding Optuna’s internal design.
I have created each version of Minituna with the intention of helping you to understand how Optuna is designed in three steps:
- Understand the main components of Optuna and how they are called
- Understand how to use categorical, integer, and loguniform
- Understand the pruning API and the median stopping rule algorithm
I encourage you to practice reading the Minituna code by typing out its entire code. Once you get a good understanding of how Minituna is designed, it will not be too difficult to read the Optuna code. This article will not cover the simple facts that you can easily read from the Minituna code. Instead, I will try to provide some tips to help you in reading and practicing with the Minituna code, as well as to explain the differences between Minituna and Optuna.
Note that Minituna was implemented based on the Optuna v2.1.0 code. Since v2.4.0 (https://github.com/optuna/optuna/releases/tag/v2.4.0), Optuna has a new interface that can pass multiple objective values in order to support multi-objective optimization. The design has not changed significantly, but you should keep this in mind when reading the Optuna code.
minituna_v1: The Roles of Trial, Study, Sampler, and Storage and How They Are Called
Minituna v1 is a very small program of about 100 lines. But, the main components have already been implemented in it, and you can run a program like the following. Please note that all the examples presented in this article are compatible with Optuna and will run without problems even if you replace the import statement with
import optuna as minituna.
This is how we define the objective function in Optuna. The objective function takes the trial object and returns the objective value. In this example, the objective function is called 10 times, and the best value it has obtained out of 10 trials (we assume a minimization problem here) and the parameters are printed.
Now let us read the minituna_v1 code. There are five classes defined in minituna_v1, and these key components appear frequently in the Optuna code as well. Take a look at the code to see what each class does and how it is called.
- Study: A component that manages information about a given optimization task
For a given optimization task, the study component manages all the information such as which algorithm to use (Sampler) and where to store the trial results (Storage). They are specified as arguments in the
- Trial: A component that corresponds to each trial
The objective function samples parameters from Optuna via the API provided by the trial object, and reports intermediate values to perform pruning.
- Storage: A component that stores the results of optimization trials.
Thanks to this component, Optuna is able to support RDB storages, enabling persistence of trial results and distributed optimization.
- FrozenTrial: A representation of each trial on the storage layer
This component holds the objective value and the parameters used to evaluate the objective function in each trial. For example, when using RDB storages, the information retrieved from the DB by SQL is put into the FrozenTrial object.
- Sampler: A component to implement an algorithm to select the next parameter to evaluate
This component implements an algorithm to find out which parameters should be evaluated in order to get a better objective value. The sampling of hyperparameters with Bayesian optimization or evolution strategies is defined in the sampler component. For simplicity of explanation, I have implemented only random sampling in Minituna. This article does not cover the details of the Bayesian optimization and evolution strategies supported by Optuna.
minituna_v2: How to Use Categorical, Integer, and LogUniform
minituna_v2 supports the following suggest APIs in addition to
suggest_uniform() (samples of real parameters from a uniform distribution).
suggest_categorical(name, choices=[...]): Samples categorical parameters
suggest_int(name, low, high): Samples integer parameters
suggest_loguniform(name, low, high): Samples real parameters from the space in logarithmic scale
This allows us to optimize the objective function like the following:
The key to understanding the minituna_v2 code is that all parameters are represented as floats inside the storage. In the previous example, the categorical parameters are strings such as “SVC” and “RandomForest”, but even these are represented as floats. For this, we introduce the following abstract base class.
Each parameter has two representations: internal_repr and external_repr. internal_repr is a parameter representation inside the storage and is a float value. external_repr is a representation that is actually used by the objective function, so it can be a string, integer, or anything. To help you understand, I encourage you to try it out.
The distribution object is needed for the conversion between internal_repr and external_repr. Therefore, the distribution object is also saved in the storage. That is the reason why the distributions field has been added to FrozenTrial.
There are several differences between Minituna and Optuna in terms of the suggest API.
- Optuna has
DiscreteUniformDistributionis essentially the same as
IntLogUniformDistributionwith the only difference being the range of discretization.
IntLogUniformDistributionis no more than adding the same kind of functionality as
IntUniformDistribution. Therefore, you will be able to read Optuna easily if you practice reading the minituna_v2 code. (* Update in Nov. 2021: Please note that two additional distributions,
IntDistributionhave recently been introduced. See https://github.com/optuna/optuna/pull/3063 for more details.)
- Optuna provides an API called
suggest_float(name, low, high, step=None, log=False), which was added because the word “uniform” in
suggest_uniform(name, low, high)and
suggest_loguniform(name, low, high)is a little confusing, and also for consistency with
suggest_int. Now that
suggest_floathas been formally adopted,
suggest_discrete_uniformcould be deprecated, but the discussion on this topic has not progressed. (* Update in Nov. 2021: The consistency of suggest APIs is being discussed in https://github.com/optuna/optuna/issues/2939 and will be improved in Optuna V3.)
- To put the distribution into RDBStorage or RedisStorage, you need to serialize it into some format. In Optuna, it is serialized into JSON and then stored. This means that the possible categorical parameter types are limited to JSON serializable objects, such as
CategoricalChoiceType = Union[None, bool, int, float, str].
Understand the Storage Layer Properly
If you have read the codes all the way to minituna_v2, you can start reading the Optuna code. Next, I will give a brief introduction to the storage layer before going into the details of minituna_v3.
When I started reading Optuna’s source code, the first thing I did was to get a good understanding of the storage layer. This is because if you understand what kind of information is in the storage, you can easily imagine what Optuna needs to do in order to implement each function. You can also roughly figure out the design differences between Minituna and Optuna by reading the code of the storage layer. To better understand the storage layer, I recommend checking out the SQLAlchemy model definition for RDBStorage.
Here are a few extra things to keep in mind when you read the code in the storage layer:
- To keep the code simple, Minituna’s storage can only hold information about one study while Optuna’s code supports multiple studies.
- Since Optuna supports multiple studies, the trial_id field in TrialModel does not always rise incrementally within the study. So, the number field was added to Optuna because it is useful to have an identifier that simply goes up like 1, 2, 3, and so on within the study in order to implement an algorithm.
- TrialState has two states: PRUNED and WAITING. The former was added to implement the pruning function which will be explained later in the minituna_v3 section, and the latter was added to implement the
minituna_v3: Pruning by Median Stopping Rule
minituna_v3 is a program of about 300 lines. It supports pruning (early stopping). Pruning is not a feature that all Optuna users use, so you do not have to understand it unless you are interested.
You only need two APIs to use the pruning function.
trial.report(value, step)to save intermediate values in the storage.
trial.shoud_prune()is true, raise a
TrialPruned()exception and stop the learning process.
As you can see from this API, all the pruning algorithms currently supported by Optuna decide whether to prune or not based on the intermediate values. In Minituna, I have implemented an algorithm called Median Stopping Rule . Fortunately, none of the pruning algorithms supported by Optuna are that complex.
All these algorithms allow terminating the learning process prematurely based on the empirical knowledge that if the intermediate value is low, the final value will not be that good either, in order to save time. In addition to the median stopping rule, which I will describe below, Optuna also supports algorithms such as Successive Halving [3, 4] and Hyperband , but the basic ideas remain largely the same.
The median stopping rule basically trims away the lower half (the ones worse than the median value of the previous trial’s intermediate values) at each step. The following figure illustrates how the median stopping rule works. Since the code is simple, reading it together with the following figure should give you a good idea of how the median stopping rule works.
If you would like to read the implementations of SuccessiveHalving and Hyperband in Optuna, there is one thing worth keeping in mind. Because Optuna is not designed to suspend or resume a particular trial, its algorithm has been slightly modified, and behaves differently from the algorithm described in the paper. To learn more about Optuna’s implementation of Successive Halving, please read Algorithm 1 in Optuna’s paper . For Hyperband, please read the blog article “How We Implement Hyperband in Optuna” posted by crcrpar who implemented Hyperband.
I will also touch on an issue related to the design of Pruner. As you can see from Minituna’s source code, the Pruner and Sampler interfaces are clearly separated in Optuna. This gives Optuna’s code good clarity and allows us to easily switch between Sampler and Pruner and use them in any combination. This is a great advantage.
On the other hand, some algorithms require Pruner and Sampler to work together closely, which cannot be implemented in the current design. In fact, a little trick was used in the background to implement Hyperband in order to make Pruner and Sampler work together. There is still room for discussion about the interfaces of Pruner and Sampler. If you have read the code and come up with some ideas for improvement, I would appreciate your suggestions.
How Joint Sampling Works in the Define-by-Run Interface
I will also describe Joint Sampling, which was not covered in Minituna due to the limited size of the implementation. This mechanism is unique to Optuna, which uses the Define-by-Run interface. To implement optimization algorithms such as
CmaEsSampler, which consider dependencies between parameters, you need to understand the concept of Joint Sampling. If you are doing research on Bayesian optimization or evolution strategies, you should be able to implement your own Sampler on Optuna once you become familiar with Joint Sampling.
The design of Minituna’s Sampler interface is almost identical to that of Optuna’ s when it was v0.12.0. However, the interfaces in Optuna v0.13.0 and later versions are different from those. You can see the difference by comparing the following Sampler interfaces.
To gain a better understanding of Joint Sampling, let us take a look at some examples and see what the search space looks like for each of the objective functions.
The search space of this objective function will always be like the one below.
Now, what about the search space of the following objective function?
In the Define-by-Run interface, the search space is determined at runtime. This example has more than one search space because there is a conditional if statement in the search space. It switches between the following two search spaces depending on the context.
SkoptSampler will then extract the ones that appear across all search spaces using the
Sampler.infer_relative_search_space(study, trial) method and pass them as the third argument to
Sampler.sample_relative(study, trial, search_space). In other words, only the classifier parameter is treated as a joint search space in the above example. GP-BO and CMA-ES are used only for samples from this joint search space, which is called Joint Sampling. Here is a diagram of how the Joint Sampling works.
rf_max_depth, and other parameters that are not included in the joint search space will fallback to Samplers where the method does not consider dependencies between variables, such as
This article introduced Minituna and provided some additional tips on how to read through the Minituna and Optuna codes. If you have practiced reading the Minituna code by typing out all its code up to Minituna v2 and have a good grasp of the whole process now, you should be able to read the Optuna code. I would recommend that you start with the component that interests you the most. Also, the Optuna development team is very supportive, and they give detailed PR reviews. I hope this article will encourage you to join us in the development of Optuna in the future.
- T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama: Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2623–2631, 2019.
- D. Golovin, B. Sonik, S. Moitra, G. Kochanski, J. Karro, and D.Sculley: Google Vizier: A Service for Black-Box Optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495, 2017.
- K. Jamieson and A.Talwalkar: Non-stochastic Best Arm Identification and Hyperparameter Optimization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR 51:240–248, 2016.
- L. Li, K. Jamieson, A. Rostamizadeh, E. Gonina, J. Ben-tzur, M. Hardt, B. Recht, and A. Talwalkar. A System for Massively Parallel Hyperparameter Tuning. Proceedings of Machine Learning and Systems, 2:230–246, 2020.
- L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh and A. Talwalkar: Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research, 18, pp. 1–52, 2017.
I would like to thank the Optuna development team and Mr. Saito of SkillUp AI for reviewing this article.
This article is a translation of the Japanese blog post of CyberAgent AI tech studio.