Stories by Masaki Kozuki on Medium

Parallel Hyperparameter Tuning With Optuna and Kubeflow Pipelines

Masaki Kozuki — Fri, 06 Nov 2020 02:31:48 GMT

Parallel Hyperparameter Tuning with Optuna and Kubeflow Pipelines

This entry is a translation of the Japanese-language blog post originally authored by Mr. Masao Tsukiyama of Mobility Technologies Co., Ltd. The Optuna community would like to thank Mr. Tsukiyama for permitting us to post this translation.
Disclaimer: All the slides and videos are in Japanese language.

Introduction

Hi there. This is Masao Tsukiyama of ML engineering group 1 of AI technology development department of Mobility Technologies (MoT).

The other day, I tuned hyperparameters in parallel with Optuna and Kubeflow Pipeline (KFP) and epitomized it into a slide for an internal seminar and published the slides, which got several responses.

https://medium.com/media/bef7314e87d8e8742bf306c8cfa4010c/href

The slide was noticed by an Optuna maintainer, and they asked me to write a blog about it. There are some duplicated contents with the above slides, so let me walk you through the below items with some figures and snippets

Optuna and KFP use case
introduction and tutorial of Optuna and KFP
Optuna’s usability and contribution to our project
tips for integration into production.

Use Case Definition

When it comes to hyperparameter tuning, you may think of the hyperparameters of machine learning and deep learning, such as the number of layers of your deep neural network.

However, we wanted to optimize the parameters of reinforcement learning models which propose the optimal route for one of our products called Passenger Search Navigation.

What is “Passenger Search Navigation”?

We, at MoT, provide a machine learning based service named Passenger Search Navigation (Japanese name is “お客様探索ナビ”) with taxi drivers using taxi distribution service “GO”. This service helps the drivers find riders when they are not waiting for ones at stations, aiming at enabling new hires at taxi companies and drivers who are not familiar with the area to make more money than the average.

How Machine Learning Plays a Role?

This section explains how Passenger Search Navigation recommends routes to drivers.

To suggest a route to a driver, we implement the following steps

Create machine learning models using stats, such as the number of rides, and predict the number of ride requests in the next 30 minutes on each road (demand).
Create other machine learning models that take the same stats and predict the number of taxis that will be on the road in 30 minutes on each road (supply).
Run inference of the both models every 15 minutes to continually update the predicted demand and supply.
Recommend a route with the predicted demand and supply with reinforcement learning.

Reference

Should you be interested in the detailed algorithm, there are some materials in Japanese.

https://medium.com/media/e9652a20900e4ad81c786621aaafb2e4/href https://medium.com/media/b0925d66e94926370a50e668b41ed22c/href

Also, I would highly recommend the below materials for those who are interested in the whole architecture and MLOps of our service.

https://medium.com/media/07cc883de703a1429ca46fcc67f26619/href https://medium.com/media/194b0e9b041c117b5247aa7d1d3b692c/href

Hyperparameters of Value Iterator

In the above reinforcement learning, there exists a component called Value Iterator. As you know, this Value Iterator has some amount of hyperparameters and these hyperparameters have some effects on the recommended route as well as the predicted demand and supply.

Furthermore, the predicted route affects the profit in the simulation and the profit of each driver. This is the reason we want to tune these hyperparameters of Value Iterator.

To be exact, the situation was like this:

Value Iterator component has a bunch of hyperparameters, however, we have not tuned them from Proof of Concept to the present.
Due to the nature of our Passenger Search Navigation, we need different models for different areas and the optimal hyperparameters would be different.
Also, optimal hyperparameters would be different in different timeframes.

Given the above, we decided the following tuning requirements:

The entire tuning process must be automated
Tuning can be executed with ease regularly or as necessary
It’s not too time-consuming
It’s not too compute-hungry; we have a limit on the server cost

While there are various frameworks for hyperparameter tuning, we chose Optuna as it is popular and it looked easy to use with its intuitive interface.

Simulation and Evaluation of Machine Learning Models

As it’s related to why we adopted Kubeflow Pipeline (KFP) for parallel hyperparameter tuning, let me illustrate the simulation and evaluation of machine learning models in Passenger Search Navigation.

In Passenger Search Navigation, two machine learning models predict the demand and supply, and then the reinforcement learning model suggests a route from that demand and supply. To evaluate the suggested route, we have a simulator to see how much profit it will make.

We evaluate the route with how many rides occurred and when and where rides happened from the records of actual demand and supply.

The criterion of updating the machine learning models included minimizing the squared error and maximizing the simulated profit.

So, our tuning task would run the below repeatedly

Update the hyperparameters of Value Iterator component
See the profit by running the simulator with the updated hyperparameters

The duration of simulation is one week (seven days), though we have already composed a KFP pipeline to collect the data of seven days in parallel and run the simulator.

We chose KFP because it’s simple to automate the tuning and we can utilize this existing pipeline.

Introduction and Tutorial of Optuna and KFP

Let me briefly summarize Optuna and KFP for those who are not familiar with them.

What is Optuna?

Optuna is an open source hyperparameter optimization framework to automate hyperparameter search. It’s released in December 2018 and its stable version came out in January 2020. It’s implemented in Python, like other machine learning frameworks.

Features unique to Optuna listed up in the official page are

Parallelize hyperparameter searches over multiple threads or processes without modifying code
Automated search for optimal hyperparameters using Python conditionals, loops, and syntax
Efficiently search large spaces and prune unpromising trials for faster results

In general, it’s designed to make it easy to implement distributed and parallelized tuning.

Tutorial of Optuna

The below is copied from the official tutorial that you can download in a Python script or Jupyter Notebook.

https://medium.com/media/3c3f37184d69266abb0470520922f0f9/href

In Optuna, the whole tuning process is called Study and each evaluation of one set of hyperparameters is called Trial.

You define the process of the evaluation of each trial from sampling hyperparameters to return of the evaluated value. In the above snippet,

create_study instantiates a new Study specifying direction=”minimize” / direction=”maximize” as the objective function is to be minimized/maximized.
study.optimize executes the tuning. The number of trials is 100.
The objective function samples the hyperparameter of x and evaluates the quadratic function.

A Study object provides the following useful methods and attributes.

study.best_params
>> {'x': 1.9926578647650126}

study.best_trial
>> FrozenTrial(number=26, state=, params={‘x’: 1.9926578647650126}, value=5.390694980884334e-05, datetime_start=xx, datetime_complete=xx, trial_id=26)

study.trials
>> [FrozenTrial(number=0, …), …]

The example we’ve looked at uses only one floating point values in linear space with suggest_float method, Trial provides the following methods for hyperparameter suggestion:

# Categorical
optimizer = trial.suggest_categorical(“optimizer”, [“MomentumSGD”, “Adam”])

# Integer
num_layers = trial.suggest_int(“num_layers”, 1, 3)

# Floating point values in linear space
dropout_rate = trial.suggest_float(“dropout_rate”, 0.0, 1.0)

# Floating point values in logarithmic space
learning_rate = trial.suggest_float(“learning_rate”, 1e-5, 1e-2, log=True)

# Floating point values in discrete linear space
drop_path_rate = trial.suggest_float(“drop_path_rate”, 0.0, 1.0, step=0.1)

Other than Study and Trial, Optuna has the concept of Storage. As the name implies, Storage tracks the history of Study and its Trials, and there are several types available depending on the use case.

InMemoryStorage:

Default storage class
Claims Storage on the memory where tuning is running
Basically not tracking Trials for long
Faster than RDBStorage if you parallelize tuning in one unique instance

RDBStorage:

Claims storage in external RDB
MySQL, PostgreSQL, and SQLite is available
Best for distributed optimization as all Study and its Trials are recorded
Allows for stop and resume of Study

In Optuna, you can specify the number of jobs with the argument of n_jobs to Study.optimize. We have only one Optuna job, but execute Trials in parallel with KFP and n_jobs, both options are feasible. However it turns out to be helpful to be able to reference the history of Study and Trials, and as sometimes we want to increase the number of trials, we set up MySQL server on GCP Cloud SQL and use RDBStorage.

What is Kubeflow Pipeline (KFP)?

Kubeflow including KFP is a framework developed by Google, and it provides enough tools to implement a whole cycle of machine learning projects on Kubernetes. As a side note, we did not use any other Kubeflow components other than KFP.

KFP is a workflow engine oriented to machine learning and is getting more and more popular these days, but we rarely see the use case of the other components.

There are other famous workflow engines like Apache Airflow and DigDag, though, KFP has some strong points as follows:

A feature named “Experiment” allows for preparing a pipeline for each experiment and you can change input parameters and execute from its Web UI.
Visualization of inputs and outputs of every single pipeline task, enabling us to check the artifacts such as Jupyter Notebook and Confusion Matrix on the Web UI.
Easy comparison of experiments and their results, leading to easier comparison of parameters.

KFP is built on top of Argo, an OSS workflow engine for machine learning, but KFP is more friendly as we can define pipelines with Python. In Passenger Search Navigation, we use KFP for R&D things, e.g., simulation and experiments and Apache Airflow for operation things, e.g., machine learning models’ deployment pipeline and their inference pipelines.

Sometimes it’s better to use different tools for different phases.

Parallel Hyperparameter Tuning Flow with Optuna and KFP

So, here comes the result: the implementation and hyperparameter tuning of Passenger Search Navigation.

Let’s consider the implementation policy upon the above Optuna snippet.

https://medium.com/media/5f3d20d2914dc9cdef8f03f7a900f46e/href

As it’s preliminary, tentatively set n_trials (the number of trials) as 100.

As the default optimization algorithm of Optuna is TPE and it’s sequential, too big n_jobs might harm the performance of TPE, therefore, n_jobs is 5 and 5 turned out not to be harmful.

So what we need to do in objective is collect the simulated profit after running the simulation using the suggested hyperparameters for Value Iterator component.

We’ve already implemented the simulation process as a KFP pipeline job and used the job for experiments and evaluation in the deployment pipeline.

The below figure illustrates how this flow is organized.

Of course, you can run Optuna locally, but it’s tedious to wait for the tuning to terminate after a couple of hours, so I implemented a KFP job to execute Optuna.

At first, a deployed Optuna job calls create_study to start a new tuning which is tied to MySQL server on Cloud SQL.

As n_jobs is 5, study.optimize evaluates 5 trials in parallel. Each running trial runs a simulator job pipeline to KFP with the suggested hyperparameters for Value Iterator component before getting the simulated profit.

Each thread is waiting for its simulation to end. The above figure looks like the threads are synchronized though, actually each thread runs Trial independently.

After the Trial finishes, store results to the storage and run the next trials. In the new trials, hyperparameters are suggested using the past records and these steps are repeated until the simulated profit gets converged.

Thanks to RDBStorage, if n_trials is not enough, we can resume this tuning with optuna.load_study.

Codebase

So far we’ve looked at how our fully automated tuning flow is implemented figuratively. Now I will show you the actual codebase.

Deploy Optuna Job

https://medium.com/media/00a5ade8fc7ae6ee5730ff1240776396/href

Let’s focus on the thing. The deployment of Optuna Job pipeline is implemented in wf.create_optuna_pipeline which will be explained soon. This method compiles the created pipeline and deploys it with wf.run_pipeline() to the KFP cluster.

KFP Pipeline Function of Optuna Job

https://medium.com/media/91b6b878e62108fd7d9ae7547cd869a0/href

This is the implementation of wf.create_optuna_pipeline(). In KFP, we can implement pipeline functions with @kfp.dsl.pipeline decorator. By passing a Slack notification Operator to dsl.ExitHandler, we will be notified when the job terminates, whether it succeeds or not.

KFP Operator of Optuna Job

This Operator runs Optuna task is implemented as follows:

https://medium.com/media/dbefaf0261f8f25b320958e3cbbfb33f/href

create_optuna_op creates a Container Operator that runs Optuna Job. There’s one task per container, so we deploy the task to the KFP cluster specifying any Docker image. Note that we can specify Sidecar Container as Container Operator.

As our Optuna job uses RDBStorage (MySQL server on CloudSQL to be exact) GCP’s official container gcr.io/cloudsql-docker/gce-proxy is used. The process of tuning is encapsulated into the Image passed to this Operator.

Tuning Execution with Optuna Job

https://medium.com/media/4d058e6a56c70156971e209312e98bac/href

Containers initiated by the Operator first call this method according to the storage argument.

We pass settings[“study_storage”] but it’s a sting of the format of mysql+pymysql://{user}:{password}@localhost/{cloudsql_datasetname} (remember, we use MySQL). You can also have this connected to MySQL Server via CloudSQL Proxy of Sidecar.

load_if_exists argument enables you to resume the existing Study of the same ID on the Storage if `True`.

The Objective Function

Recap of our objective function

Sample hyperparameters for Value Iterator component
Run simulation with the hyperparameters sampled above
Get and return the result (= profit) from the simulation

https://medium.com/media/1811f7af3f0aa9e952d3557c57102510/href

As we have multiple hyperparameters for different Distributions of the Value Iterator component, each of them are suggested by using getattr to specify the distribution and the domain after writing down distributions and search space for each hyperparameter to a config file.

Next, it checks if the number of correctly completed trials from Study is smaller than the specified number of trials. The waiting for simulation sometimes fails, so there is a workaround in case len(trials) doesn’t tell us the exact number of completed trials.

This kind of handling can be implemented with a callback function for Study.optimize. In the callback which takes Study and Trial as its inputs, check the number of completed trials and compare it with the maximum number of trials in order to decide whether to stop (Study.stop) or not.

run is a function that generates a simulation pipeline (Operators) and deploys it to the KFP cluster as already explained.

Finally, different from the deployment of Optuna Job, we need to wait for the simulation termination after run_pipeline before collecting, summarizing, and returning the simulated profit.

Evaluation of Hyperparameter Tuning

If the simulated profit converges, move on to the comparison of hyperparameters.

To compare the performance of the hyperparameters previously used and the hyperparameters tuned:

Tune the hyperparameters for the fixed two machine learning models with the data of 2019/10/01–2019/10/7
To see that the tuned hyperparameters are not overfitting, evaluate the two sets of hyperparameters on the data of 2019/10/08–2019/02/29 by simulation
As to the demand and supply in the simulation, we use both the predicted values and statistical values. This can be the case for our production environment.

From this comparison, we confirmed that the profit is raised by 1.4% and 2.2% by using the predicted values and statistical values, respectively.

Then we release the new set of hyperparameters to the production after having the quality assurance team confirm the proposed routes are valid.

So, in short, we can gain the profit increase using fully automated parallel tuning with Optuna better than or equal to the profit increase with the policy by our algorithm team. Note that the latter requires work and is not automated.

Usability of Optuna

As said previously, the tuning with Optuna brought more than the 2% gain to our drivers. The hyperparameters that are achieved are already live and suggesting routes to our drivers.

We have not been able to compare the profit of our drivers before and after the release, thus we are a bit sorry not to be able to say “our automated tuning gives our drivers joy”. A real world comparison is difficult as the profit and routes depend on season, timeframe, and luck.

So here are some thoughts of using Optuna in our product:

Interface is intuitive and easy to parallelize and distribute
Easy to keep implementation simple because we can handle the objective function as a black box of taking hyperparameters as inputs and returns some value.
Thanks to this design, we were able to use Optuna without modifying our codebase of simulation and its execution as we just encapsulate Optuna specific logic into new functions.
SDK is useful
Most information we want can be accessed from Study and Trial, which make it easier to do some detailed processing in the objective function and post-analysis of tuning
Integrating Optuna into existing projects is not hard because of 1, 2, 3.
Customizability though we haven’t tried

As the simulation is implemented as a KFP pipeline, our codebase deploys KFP jobs in a nested manner, making it a bit complicated.

We were looking for a better design initially, but chose this approach because we could understand the relationship between tuning and simulation.

In short, the design is easy to understand and we could move to implementing our logic quickly.

Tips for Optuna in Production

We had enough benefit from integrating parallel hyperparameter tuning into our product because

we would create multiple models and each model has different optimal hyperparameters.
timeframes cause models to use different hyperparameters for better performance.

As noted beforehand, since Optuna enabled us to handle the logistics that we want to tune as a block box, keeping existing smaller codebase is not difficult.

As our problem is not so common and we had enough resources, we spent the time to automate the tuning process. However, even if your problem is simple like wanting to run tuning on a few instances for experiments, I would say that you will enjoy the merit of Optuna.

So, whatever hyperparameters you are tuning, the general workflow will be as follows:

you have some inputs like hyperparameters and some outputs that you want to optimize
If the number of hyperparameters are astronomic, starting from small portion of them would be good
Confirm that it is allowed to tune your hyperparameters.
Sometimes they include some that should not be tuned or tuning is meaningless.
Implement the objective function.
Keeping the relationship of this function and the experiment logic sparse is great.
e.g. Encapsulating it into KFP’s Experiments Pipeline
The only requirements are that the function takes some inputs and outputs something.
Choose the number of trials and run tuning
You can resume the study with RDBStorage, you can increment the number of trials like 100 -> 150 -> 200
Run experiments to check the performance the optimal hyperparameters Optuna finds
In our case, we ran simulations for the period that are not the target of tuning to see the gain.
For the case of machine learning model training, do tuning on validation accuracy or loss and see the test accuracy lastly.
Finally, decide whether to use that hyperparameters

Conclusion

We integrated Optuna for the first time while we’ve used KFP for simulations.

To repeat, Optuna is good for its intuitive interface, good documentation, and its design that allows for simple implementation.

Since this problem would enjoy the benefit of re-tuning after some period, we automated the comparison experiment by scheduling a regular tuning pipeline.

Hope you get some takeaways by this post, thank you.

Parallel Hyperparameter Tuning With Optuna and Kubeflow Pipelines was originally published in Optuna on Medium, where people are continuing the conversation by highlighting and responding to this story.

How We Implement Hyperband in Optuna

Masaki Kozuki — Wed, 26 Feb 2020 05:27:44 GMT

How We Implement HyperBand in Optuna

UPDATE (2020/08/31): The content would be somewhat outdated because we improved the algorithm and interface. See the details for the following five pull requests: optuna/optuna#1138, optuna/optuna#1141, optuna/optuna#1171, optuna/optuna#1188, optuna/optuna#1196.

This post requires some familiarity with Optuna and targets those who are rather interested in software for Machine Learning and Deep Learning than ML and DL technologies and/or algorithms themselves. Also, I hope some of you feel like giving Optuna a try, or contributing to Optuna after you read this.

If you’re new to Optuna, here’s the quickstart on colab, v1 release announce post! :)

TL; DR

Sampler and Pruner of Optuna are loosely connected for modularity.
Optuna has experimentally supported Hyperband [4] now from v1.1.0, one of the most popular and competitive hyperparameter optimization (HPO) algorithms.
Challenges in implementing Hyperband in Optuna is discussed in this post.
We resolved the challenges in a simple and beautiful manner.
We can use competitive hyperparameter optimization of Hyperband with ease. Feel free to try it now! Benchmark results are shown later.

In the first section, I briefly introduce the structure/modules of Optuna. Then I’ll illustrate how pruning algorithms work with a naive algorithm. After that, Successive Halving, a popular HPO algorithm is introduced and its weak points are discussed. Last, Hyperband is introduced and how we implemented is described.

Under the Hood of Optuna

If you are already familiar with Optuna, feel free to move on to the next section.

Optuna finds the best hyperparameter configuration from a number of possible hyperparameters. Under the hood, there are five major components in Optuna:

A Study object is responsible for finding the set of hyperparameters that achieves the best value of the user-defined objective function. One set of hyperparameters and values of the objective function obtained by this set are monitored by one Trial. Trial reports those values and its ID (Trial.number) to Storage of Study. Sampler is in charge of the sampling processes of hyperparameters. Pruner stops unpromising Trials by comparing its intermediate values with those of previous Trials. Note that Pruner is not always enabled. It works only when you report the intermediate values through Trial.report within your objective function.

How Pruning Algorithms Work

Before diving into Hyperband, let us go through an example of a basic pruning algorithm to show how it works and the best hyperparameter configuration might be overlooked. Every epoch, we cut off the worst half of the running configurations. Assume there are eight hyperparameter configurations and learning curves of their complete training are as follows. Note that these curves are NOT available before hyperparameter optimization.

First, we run 8 configurations for one epoch to collect the validation accuracy of them.

Here, the bottom four configurations, i.e., 1, 4, 2, and 3 are stopped. We repeat this procedure until there is only one configuration. So, we run the remaining four configurations for another epoch.

As of epoch 2, 5 and 8 are pruned. The algorithm fails to detect that 8 is the ultimate best configuration at this moment.

After 3 epochs of training, configuration 7 is better than 6. 7 will be trained fully.

As you might notice, pruning algorithms might stop the best configuration prematurely, however, there is no way to prevent this as we cannot predict learning curves 100% accurately with any algorithms.

Optuna’s Pruning Algorithms So Far

Optuna has provided basically two pruning algorithms: median stopping rule [1] and (asynchronous) Successive Halving (SHA) [2][3] named MedianPruner and SuccessiveHalvingPruner, respectively. Median stopping rule “prunes if the trial’s best intermediate result is worse than median of intermediate results of previous trials at the same step” (quote from Optuna documentation). SHA introduces the concept of budget (hereafter, denote it as B) that is equivalent to the computational resources of HPO experiment. The design principle is that allocating more resources, e.g., epochs, to promising trials by reducing the number of trials to half, i.e., stopping unpromising ones. In this situation, the intervals grow exponentially by the user-defined ratio (e.g. 3). However, as its focus is on faster configuration evaluation rather than hyperparameter configuration selection of Bayesian Optimization, Successive Halving has a hyperparameter for itself related to the number of trials to examine (hereafter, denote it as n). This means that SHA has a tradeoff between n and B. When it requires more resources to differentiate better configurations from worse ones, i.e. learning curves can change drastically in their training, it’d be better to set n small to reduce the wrong judgment. Contrary, if learning curves change monotonically, i.e., their order does not change during the training, we have to set n larger to expand the probability of finding better hyperparameters.

For further details, please refer to the original papers and the post by the authors.

What is Hyperband

As mentioned briefly, Successive Halving has hyperparameters and they are in the relationship of trade-off. This trade-off, called “n versus B/n” in the Hyperband paper, affects the final result of HPO. Of course, all the trials can be correctly sorted and selected if the final results are available. But, pruning tries to stop unpromising trials as quickly as possible and how learning curves are shaped is totally different. Also, to judge which trial is better than the other with enough confidence, there should be enough gap between the two trials. If there is some prior knowledge about the tendency of learning curves are available a priori, we can choose appropriate n. However, what makes things complex and challenging is that the characteristic of learning curves is up to the task.

To address this trade-off, Hyperband runs multiple Successive Halving instances with different n values and each instance is responsible for a portion/subset of the trials, called brackets.

Implementing Hyperband in Optuna

In this section, first I’ll show you the most naive implementation and discuss weak points of it. Then, necessary characteristics of Hyperband implementation, difficulties in realizing those features, and how we resolved them are explained.

The Naive Implementation and What it sacrifices

The simplest implementation of Hyperband with Optuna is that using multiple Studys as follows.

def main():
    best_trials = []
    for bracket_config in all_bracket_configs:
        sampler = ...
        pruner = SuccessiveHalvingPruner(**bracket_config)
        study = optuna.create_study(sampler=sampler, pruner=pruner)
        study.optimize(objective, ...)
        best_trials.append(study.best_trial)
    best_trial = min(best_trials, key=lambda trial: trial.value)
    ...

This is solid as it’ll be easier to implement the algorithm described in the paper truly but only if you don’t care how to resume this optimization workflow (optuna.load_study), how to manage storage, and/or how to execute this implementation in a distributed environment correctly. Since distributed execution is one major feature of Optuna, it is not desirable to give up it partially just by adding a new feature however it’s impactful.

Challenges to Keep the Current Design and How We Resolved

So we decided to implement Hyperband in the same manner as the other pruners. In this way, there were three challenges:

how to choose a bracket for a new trial
how to compute budgets of brackets
how to collect trials of one single bracket in pruning/sampling phases

In each section, the design decisions we made follow the description of challenges.

Challenge 1 & 2: Design of Hyperband

By definition, budget is a thing related to time, not the number of trials. So, for each bracket to consume approximately the same budget, it’s necessary to give a reasonable number of trials to brackets. Also, as Optuna allows users to run study.optimize infinitely long and stop by ctrl+c, both n_trials and timeout can be None, leading to null budget information. Thus, we decided to introduce randomness in bracket selection and bracket budget computation algorithms. This is because no algorithm can satisfy all the requirements of Hyperband and the constraints of Optuna. This design choice made the implementation a bit different from the algorithm described in the paper, however, the performance is solid in the benchmark. Also, budget computation is a bit modified but the trends of budget values follow the paper. It’s worth mention that in spite of this naive allocation, it defeats the other pruning algorithms in benchmarks executed with sile/kurobako.

Challenge 3: How to Collect Appropriate Trials

As to the implementation of trial collection in pruning/sampling, the most arguable thing. Optuna has kept pruning module and sampling module loosely coupled and this has contributed to the modularity of Optuna (This means that Study module is responsible for a bunch of tasks). So, without any modification, brackets and samplers will take into consideration the history of all the trials including ones that are monitored by the other brackets. Therefore, we needed to implement some filtering logic while keeping samplers and pruners loosely-connected as possible. We had two ideas of how to resolve.

One is adding bracket index to trials.use_attr and setting the list of trials of the same bracket as its attribute for samplers to get access to the list of trials. The required changes can be found in optuna/optuna#785. As you can see, the number of changed files is 22. This is not reasonable from the perspective of maintainability. Also, most of the changes are really ad-hoc while they are required only by HyperbandPruner.

The other is wrap a study object, especially get_trials that filters out trials based on the current trial’s bracket index as done in optuna/optuna#809. This can be implemented in a way that we can encapsulate the required ad-hoc changes inside hyperband implementations, i.e. HyperbandPruner class. How? We have implemented a wrapper class of Study whose get_trials method effectively filters out irrelevant trials before returning the list of trials. This design made it simple to make TPESampler compatible with Hyperband. Since TPESampler uses the history of HPO, it has to track which bracket monitors which configurations. This seems to require a bunch of changes, but as you can see in https://github.com/optuna/optuna/pull/828, the changes are compact.

Benchmark Results

Takeaway: the current implementation works well and easy to use.

The task is chosen from HPOBench [5].

First of all, to show the benefit of Hyperband, we ran the experiment of TPESampler and different pruners. In the below figure, deep green represents the HyperbandPruner and it achieves the best performance. Intuitively, Hyperband mitigates the burden of finding the best eta value of SuccessiveHalving. Naturally, some Successive Halving instances would beat Hyperband if they use good hyperparameters. Though, it’s not true in this benchmark. We attribute this to the nature of TPESampler: it sometimes gets stuck at saddle points while HyperbandPruner virtually four TPESamplers leading to the avoidance of those local optima. It might be worth mentioning that the task is different from that of the Hyperband paper.

RandomSampler

As mentioned beforehand, SuccessiveHalvingPruners with different eta, a parameter that affects n, values show different characteristics. More specifically, the case of eta=0 is terrible while the others show achieve better and similar scores. This is “n versus B/n tradeoff.” Also, you can see that Hyperband is not the best but it looks competitive considering that we don’t need to run different Successive Halving experiments.

Conclusion

In this post, I briefed Optuna and how I implemented Hyperband. The current implementation works really well including that sometimes well configured Successive Halving is better, however, it’s experimental and there’s room to improve how to select a bracket for a Trial and how to compute budgets for SuccessiveHalvingPruners run by HyperbandPruner. Optuna team, really welcome any comments, feedback, and thoughts! Feel free to comment, join Gitter, and enjoy Optuna!

Cited Works

[1] Google Vizier: A Service for Black-Box Optimization, https://research.google.com/pubs/archive/46180.pdf

[2] [1502.07943] Non-stochastic Best Arm Identification and Hyperparameter Optimization

[3] [1810.05934] Massively Parallel Hyperparameter Tuning

[4] Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

[5] [1905.04970] Tabular Benchmarks for Joint Architecture and Hyperparameter Optimization

How We Implement Hyperband in Optuna was originally published in Optuna on Medium, where people are continuing the conversation by highlighting and responding to this story.

Get Better fastai Tabular Model with Optuna

Masaki Kozuki — Fri, 01 Nov 2019 03:01:01 GMT

Note: this post uses fastai v1.0.58 (PyTorch v1.3.0)and optuna v0.17.1.

Introduction

Optuna is a hyperparameter optimization framework applicable to machine learning frameworks and black-box optimization solvers. We can use Optuna with ease in our code by defining an objective function to be optimized. See examples in the repository.

fastai library makes it easier to try deep learning and provides a bunch of latest techniques (best practices) that enable us to obtain competitive models. learn.lr_find for optimal learning rates, learn.fit_one_cycle for superconvergence, and MixUpCallback Callback for the de facto data augmentation. Also, it supports features for data validation (Look at data | fastai) and investigation of trained computer vision models (Computer Vision Interpret [fastai]).

FastAI has three applications, vision, text, and tabular. Fastai focuses on fine-tuning in vision & text as there are a ton of neural network models trained on massive datasets, e.g., ImageNet for vision models and texts collected from the web for language models. Those models are said to have a common sense (I mean they have enough basic knowledge so that they can quickly adapt to new tasks). However, as to the last application, tabular, there are no appropriate datasets for pretraining because it seems almost impossible to define what is a general feature of tabular tasks. If you look for a tabular dataset in Kaggle, there are a bunch of competitions, for example, Instacart, Rossman, and titanic.

Optimize TabularModel for Rossman data

So, I’ll try to get a better TabularModel trained on Rossman dataset than that obtained in fastai’s lecture by letting Optuna find the optimal number of layers and units of each layer, and dropout ratio.

The task is https://www.kaggle.com/c/rossmann-store-sales#. In this competition, it’s expected to create a model that predicts future store sales of the coming six weeks. As you can see in the Data fields, there are both categorical and numerical features. In TabularModel, numerical features are handled as one vector, and each categorical feature is embedded into a vector. This technique is called Entity Embedding. Intuitively, Entity Embedding enables models to learn some useful relationships between instances of categorical features from training dataset. So, TabularModel has some embedding layers and groups of linear (a.k.a. dense), batchnorm, and dropout. The activation function is ReLU.

For those interested in the details of data processing, please see the lesson video. Here, I just use the preprocessing Jupyter notebook to get the same data used in the lecture.

In the original notebook, a model is defined as below.

learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04, 
                        y_range=y_range, metrics=exp_rmspe)

This means that the model has two hidden layers and each layer applies dropout with the ratio of 0.001 and 0.01. Also, it uses dropout with the ratio 0.04 to the concatenated vector of embeddings of categorical features. See the docs for the details. Therefore, I’ll try optuna to find better hyperparameters of

the number of layers
the number of units each layer has
the dropout ratio of each layer
the dropout ratio of a concatenated vector

Define fastai TabularModel with Optuna

To use optuna in your training scripts, the only thing to do is defining an objective function which takes optuna.trial.Trial as its input and returns the value to optimize, for instance, accuracy/loss on validation dataset as follows.

https://medium.com/media/9261118f09010fd85d05eceb955e9e63/href

As a reference, in the original notebook, fit_one_cycle was used three times, and each fitting ran for five epochs. All exp_rmspescores on validation were 0.105433, 0.116344, and 0.126323. By optuna optimization with 100 trials, I got 0.102660. The details are below.

Best trial:
  Value:  0.10201516002416611
  Params:
    n_layers: 3
    n_units_layer_0: 800
    dropout_p_layer_0: 0.1
    n_units_layer_1: 900
    dropout_p_layer_1: 0.2
    emb_drop: 0.1

Faster Optimization with Pruning

While I did get a better model, Optuna ran all the trials (100 trials). Approximately, I ran `100 trials x 5 epochs/trial = 500 epochs`. However, all the trials do not use reasonable hyperparameters due to the randomness of each hyperparameter’s sampling.

So, intuitively, we can do early stopping to some trials with bad hyperparameters to reduce the total time. This early stopping in hyperparameter optimization is Pruning and Optuna supports some strategies for pruning like Successive Halving and callbacks for popular machine learning frameworks such as Keras, MXNet, Chainer, and PyTorch Lightning. See the documentation for the list.

Implement FastAIPruningCallback(TrackerCallback)

In fastai, training and validation loops are abstracted inlearn.fit or learn.fit_one_cycle. Pruning is a variant of EarlyStopping, and the only difference is that pruning is done by optuna.trial.Trial, not Learner. So I implemented the callback as in this PR for Optuna and a simpler implementation is below.

class FastAIPruningCallback(TrackerCallback):
    def __init__(self, learn, trial, monitor):
        # type: (Learner, optuna.trial.Trial, str) -> None

        super(FastAIPruningCallback, self).__init__(learn, monitor)

        self.trial = trial

    def on_epoch_end(self, epoch, **kwargs):
        # type: (int, Any) -> None

        value = self.get_monitor_value()
        if value is None:
            return

        self.trial.report(value, step=epoch)
        if self.trial.should_prune():
            message = 'Trial was pruned at epoch {}.'.format(epoch)
            raise optuna.structs.TrialPruned(message)

By incorporating pruning, the final result might be less competitive than that of a study without pruning because it is almost impossible to predict learning curves precisely. However, the time of optimization should be reduced a lot. And the result is as follows:

Study statistics:
  Number of finished trials:  100
  Number of pruned trials:  63
  Number of complete trials:  37
Best trial:
  Value:  0.10323499143123627
  Params:
    n_layers: 3
    n_units_layer_0: 900
    dropout_p_layer_0: 0.1
    n_units_layer_1: 1100
    dropout_p_layer_1: 0.15000000000000002
    emb_drop: 0.05

The summary of this post is the below table. Trainings are done with GTX 1080Ti.

https://medium.com/media/0dee3da91e36411ceea7f953347b4464/href

Also, as the table shows, the total time needed by Optuna is reduced from852to 555 minutes, about 35% reduction.

How Pruning Effects Study Time

The script used in this blog post is https://github.com/crcrpar/fastai-optuna-rossman.

fin.

Swift for TensorFlow meetup in Tokyo

Masaki Kozuki — Fri, 12 Jul 2019 06:15:21 GMT

On 2019/07/10, I attended Swift for TensorFlow meetup #1.

First, I want to say thank you to date-san for kick-off, some members from mercari for offering awesome event space and food&snacks, Eugene for solid tutorials with Colab, omochi-san for the great general introduction to neural network and frameworks for it in Japanese, and of course, all the attendees with curiosity.

I expected the skills of attendees would be either iOS or Machine Learning. However, some experts of compilers took part and asked high-level questions. I have never imagined that situation, though, interactions of iOS, ML, and compiler engineers was so exciting.

The first speaker was omochi-san, an iOS developer and an expert of Swift compiler. He briefed major components of Deep Learning from differentiation to define-by-run & define-and-run. After this review, he explained why Swift for TensorFlow is promising and what he expects S4TF to be, for instance, S4TF to be the killer app for Swift to be general purpose language from the perspective of the pie of usage.

The second and main speaker was Eugene Burmako working for S4TF team. Thanks to the omochi-san’s excellent introduction talk, Eugene seemed to skip some slides and directly dive into Jupyter notebook examples on Colab.
S4TF has many cool features, but, for me, the best cool things are that its compiler can differentiate our custom functions if they are mathematically differentiable and its compiler extracts Tensor operations and gives them to TensorFlow runtime (called Graph Program Extraction but deprecated now). Additionally, the compiler tells us why the function is not differentiable with rich messages.

In the last, while I’m not familiar with TensorFlow, S4TF, and compilers, I talked about MLIR — a new intermediate representation which helps to unify complicated TensorFlow compiler ecosystem and at the same time, it can improve current Clang compilation flow. Because studying new things always gives me a new perspective on what I am familiar with or accustomed to, and I believed it would be a great opportunity to get feedback with some expertise that sometimes is difficult to obtain by self-taught.

The preparation process was so naive & straightforward. First googled “MLIR” and found videos in 2019 LLVM European Developers Meeting where some developers working for MLIR did a general introduction to MLIR and tutorials about how to create a toy language with MLIR. I will explain why MLIR is needed what MLIR supports; however, I recommend that you watch the conference talk and tutorial :)

Today, TensorFlow supports some frameworks optimized for inference: TensorFlow Serving, TensorFlow Lite including NN API, TensorRT, nGraph, Core ML. Also, TensorFlow enhances its performance of models by supporting XLA, which is good at optimizing computational graph (aka dataflow graph) by, for instance fusing operations using knowledge about them. Therefore, there are various kinds of computational graph representations that TF needs to support or target: TF Graph, XLA HLO, TF Lite, Tensor RT, nGraph, and Core ML. What makes things worse, those representations are similar but different. Also, since machine learning techniques are evolved day by day, there will be more and more operations to support as their API.

In MLIR type system, Scalar type including bfloat16 used in TPU, Vector type, and Tensor type which allows dynamic shape multi-dimensional arrays. Also, MLIR Operations support declarative operation definition, which eases the work to implement tons of ML operations in C++. There’s no fixed list of operations, and it’s totally extensible. One cool thing of MLIR is that its operations can take/return multiple arguments/values. Also, they can be configured using attributes. This attribute should be important for epsilon value of BatchNorm, strides & paddings of Convolution. Furthermore, operations can have more than one region.

This rich & flexible spec of MLIR will ease the cumbersome work to translate an operation of framework A to corresponding operations of framework B by defining dialects using MLIR. In MLIR Dialects, we need to define their original types, pass managers, and so on.

So far, we’ve seen that MLIR leaves dialects to keep structures and supports declarative operation definitions using dialects. But how about translating an operation of one framework to another? To access this, MLIR provides M-to-N pattern.

Finally, MLIR can improve Clang compilation flow by enabling us to implement C/C++ specific higher level IR (= CIL) as SIL in Swift. Also, we can make OpenMP as its dialect. By this, it will be more accessible to optimized C/C++ (with OpenMP) code.

Lessons from fastai Machine Learning

Masaki Kozuki — Sun, 03 Feb 2019 06:43:45 GMT

I watched all the lessons after a long time from [this post](https://medium.com/@crcrpar/what-i-learned-from-fast-ai-ml-till-5-510040c6d91f). However, the last couple of lessons had some contents like Kaggle Rossman competition that were done in Deep Learning course. So, I did not take notes so much.

The most impressive contents are
* Implement Random Forest from scratch using numpy and optionally Cython. Here, I think we can use CuPy: A NumPy-compatible matrix library accelerated by CUDA.
* Random Forest interpretation: feature importance, partial dependence, tree interpreter type 2 (explain one prediction), and extrapolation.
As to interpretation, I think Christoph Molnar’s “Interpretable Machine Learning” is must-read because the book covers conventional machine learning and methods explaining neural networks’ predictions.

But, because I’ve been busy so I copy&paste my log here. I might elaborate on this post afterward.
Also, the last couple of lessons did some contents done in fast.ai Deep Learning course.

What Impressed Me Most

One lesson from the last half is what really matters is feature importance, not AUC score nor accuracy. And another lesson is Random Forest returns the average of neighbor points in the tree space. So, if the inputs are distant from the space, predictions should be the average of whole training dataset samples.

Hereafter, I just copy my notes. So feel free to stop reading :)
I’ve been busy so I copy&paste my log here. I might elaborate on this post afterward.
Also, the last couple of lessons did some contents done in fast.ai Deep Learning course.

Lesson 6

Why do I need machine learning?
the drivetrain method
Examples of how ppl use machine learning in business (in slide).
Horizontal applications.

Churn: to predict who is going to leave.
jeremy howard data products book would be interesting.
defined objective, levers, data, models.
levers = what inputs can we control
data = what data we can collect
models = how the levers influence the objective
levers for churn prediction is…
motivate users not to leave the service?
change the prices?
clarify what we can actually do!
after this, clarify what data is available or necessary.
In practice, care more about simulation.
build a simulation model;
predict what happens by what the model predicts.
~~optimization model basically ~~
predictive model goes into simulation model giving the predictions.
simulation model predicts the probability the target changes his/her behavior by the action we made.

about interpretation more of prediction.
Use feature importance to decide the next action!

In business, what really matters is feature importance, i.e., understanding not AUC score.

— -

vertical applications
readmission risk
a predictive model is helpful of course but feature importance would play a role.
you can build a chart w/o machine learning, but if with machine learning and its feature importance, the chart will be much improved and help decision making.

there is still skepticism from unfamiliarity with the approach to data.

— — break — -
random forest interpretation.
confidence based on tree variance.

how to calculate feature importance for a certain feature. type 1
how to calculate from a trained random forest?
- shuffle randomly the column and calculate the score and the gain.
jeremy looks at the relative difference.
The scalars themselves are not important to him?
also plots of gain is helpful. plateau of low values features would be not helpful.

Partial Dependence.
there always be a bunch of interactions of different features.
So 2D plot cannot describe this and would be a big problem.
How to calculate?
by leaving every other features as is and replacing the values with a single value, and the calculate the prediction. Repeat this!
partial dependence plot tells the underlying truth.

Tree Interpreter type 2
feature interpretation for a specific observation
like a waterfall

Extrapolation
(live coding)
gain from multiple enclosures is the interaction of them.
RF just returns the average of neighbor points in the tree space.
If the inputs are really far from the samples in training dataset,
it just returns the average of the whole training dataset.
ATM, no way to handle this, but there are time series analysis and neural net.

Lesson 7

random forest and neural nets are 2 keys.
a lot of progress has been made in decision tree based methods like random forest and GBM.
RF is harder to screw up than GBM.

22 observations. t-distribution turns to normal distribution.
validation.
standard error = std / sqrt(n)

oversampling till the number of instances in each class is the equal to the most common class is the right thing to o.
Or, stratified sampling to create mini batches.

— -

bulldozer kaggle competition.
sample rows w/ replacement

Decision Tree doesn’t have randomness.
Randomness happens in creating a bunch of decision trees in Random Forest, i.e., choosing indexes.

how to find variables to split in decision tree?
lhs.std() * lhs.score + rhs.std() * rhs.score()
O(N) implementation using sqrt( x**2 — mean(x) ** 2 / N)

class A
def foo(self, …)
def foo(self, …)
A.foo = foo

Start from assumption and assuming I’m wrong in coding.
ternary operator is helpful.

sklearn’s Random Forest is written in Cython.
First time, it shoulld be slower.

Working with NumPy (Cython docs)
https://cython.readthedocs.io/en/latest/src/tutorial/numpy.html

RF is a nearest neighbor methods.

Lesson 8

pickle works for nearly every python object but not optimally.
pickle files are only for Python.

In random forest, normalization to independent variables doesn’t matter.
The order does matter. Random Forest ignores the scale or statistical distribution problems.

What I Learned from fast.ai ML till 5

Masaki Kozuki — Sun, 21 Oct 2018 03:09:20 GMT

This post is about what I learned from fastai Machine Learning course published this September.
Edit (2018/10/20): Since fastai course 1 V3 starts in a few days. I stopped watching 6~12 lectures.
Note: In this post, I use fastai v0.7, not v1.0.

What is Random Forest?

Random Forest is one of the most famous machine learning algorithms because it is easy to use and applicable to both classification and regression problems even if each data sample is composed of both categorical (e.g. ZIP code) and continuous (e.g. price) variables. Also, random forest avoids BAD overfittings and it can achieve fairly good results with a few pieces of feature engineering. Further, data samples do not need to be i.i.d. samples while most linear machine learning algorithms require this property. So, it is a good point to start any projects related to machine learning.

Use Random Forest To Understand Data More!

Random Forest consists of a bunch of trees. In scikit-learn, we can choose the number of trees by passing 1 ton_estimators argument. In this case, the trained model is decision tree. So, the visualization of your model tells you which features are more important/effective than the others.

Let me show you an example from Lesson 1 notebook. Where the goal is to predict the sale price (= regression) of bulldozers. Details are Kaggle bulldozers competition. This dataset has both categorical and continuous variables and each data sample has a timestamp.

Visualization of the decision tree from lecture notebook.

In the above figure, every single node has 4 lines: 1) feature (column) name ≤ criterion, 2) Mean Squared Error = loss value, 3) # of samples included, 4) average of predicted sale prices. As you can see, more left features are more important/effective than others. In this figure, some features are categorical though, their criteria are float numbers. Why? Because categorical variables are translated into integers. Of course, sometimes categorical variables have an order, but usually, reflecting the order to the translation doesn’t improve scores. Another thing I want to mention is fastai provides really useful draw_tree function as below:

draw_tree function

Random Forest in Supervised Learning

In the above paragraph, I show that machine learning sometimes helps us know more about datasets. Though, our original goal is to get super cool models which can predict values/labels of unknown data samples. Also, literally, a random forest is composed of a bunch of decision trees. Every single tree is like a tree used in the previous section. In other words, averaging predictions to make outputs more accurate with less variance. This averaging method is usually called bagging. As a side note, we can use bagging when there are different models like a pair of SVM and Random Forest. In scikit-learn’s random forest, n_estimators defines how many trees are used to build a random forest. So, larger n_estimators means less variance and higher accuracy. You can check this effect by changing n_estimators argument of Random Forest. But note that there is a limit where you cannot improve any more. When you use Random Forest and your model has a serious gap between a training set and corresponding validation/development set, it is a good choice to set oob_score true. OOB stands for “Out-of-bag”. What is out-of-bag score? Out of bag score is calculated by using samples not used in each tree building. So it is like quasi-validation score intuitively. However, this score is usually worse than the validation score.

Frequently Tuned Hyper Parameters

Of course, there are some parameters frequently tuned. I experimented these parameters’ effects in this notebook.

n_estimators : Number of trees composing one forest.
max_depth : Maximum depth (= height) of each tree. If not specified, the depth is up to min_samples_leaf.
min_samples_leaf : Number of samples whom each leaf node has. In other words, the minimum number of samples to expand nodes / deepen trees.
max_features: Number of features (columns) to obtain the best split. If this is not specified, in each step, the algorithm looks for all the remaining columns. By using this argument, every single tree is going to be less accurate but have different properties. This leads to better models.

Technique for Categorical Variables

Random Forest doesn’t know the order of categorical variables until we tell. For example, if one category is about the size of a product: large, middle, small, then we assume labels to be 2, 1, and 0 or 1, 0, and -1, respectively. However, if labels are messed up like high->1, middle->2, and small->0, a model doesn’t know anything at all. This situation goes worse when the number of classes in a category is large. One way to attack this is one-hot encoding. It is easy to implement that add #classes columns and each column represents whether the sample has the attribute represented by the column. So, if you apply one-hot encoding to high, middle, and small, then adding high column, middle column and small column to the tabular dataset and every single row sets 1 in one of 3 columns. This technique might be ineffective, but usually changes the order of feature importance. So there will be a new understanding of your datasets.1

Feature Importance

One easy way to tell the importance is calculating the difference between scores on the dataset where one column is randomly shuffled and the original validation dataset. After applying this method to all the columns one by one, you will get the list of gaps. Intuitively, if the gap is small, the shuffled column (feature) is not important.

Another way is partial dependence. This is obtained by replacing one column with one constant value. By plotting this, we can know whether something extraordinary happens or not.

When you use temporal datasets…

Datasets containing timestamps is more difficult than other datasets to split them into training, validation, and test. Because if a validation and/or test dataset includes older samples than ones of training, your model predicts using strong prior knowledge about validation and/or test. So if your dataset is related time, splitting is done according to chronological order. Hence prediction on test dataset is executed after finding good hyperparameters and retrain a model using them on training + validation dataset.
Also, it is worth trying to remove implicitly time-related variables from model inputs. To detect whether a variable is time-related or not, train a model to predict each sample is from training or validation dataset. Easy to predict variables should be removed and it will improve your models.

some deep learning algorithm related to fashion

Masaki Kozuki — Sat, 10 Feb 2018 02:36:11 GMT

In this January, I have worked on survey on virtual try-on and new pose image synthesis methods, using deep learning, mainly GANs.

https://medium.com/media/37d00afb9d12372283793a45a380c62a/href

Algorithms mentioned in the above slide are beneficial, however, most of them utilize other pre-trained models like open pose which is not free for commercial use. In addition to this, some use pre-trained semantic segmentation models tuned for ATR Human Parsing dataset. The reason they use another model is for main image-generating model to clarify the areas of faces and clothes in order to generate more vivid and sharp images.

Around these methods, there is one critical problem: no clear criterion to evaluate models’ outputs. There is only one, qualitative evaluation. We need to devise one. Another problem is that most of them apply only to tops and they are vulnerable human parts such as front arms and hair.

note on Style Transfer methods

Masaki Kozuki — Wed, 22 Nov 2017 14:41:22 GMT

Style Transfer is one of the most famous deep learning applications. You know one photo obtain the taste of another photo or famous painting to itself, other than a horse is changed to a zebra (called CycleGAN). Style Transfers are different than normal deep learning methods in that using pre-trained convolutional neural networks e.g., VGG16 as one component of the loss function.

The first algorithm did not generate style applied images by a neural network, generate ones by optimization(LBFGS).
So, it didn’t require learning which takes a lot of time but requires a lot of time to “infer” compared to NN based style transfer methods. OTH, NN based style transfer methods require learning time in exchange for fast image generation. Naive NN based style transfer is limited that one trained model can handle one style.

So what should be solved is how to handle multiple style images in one NN, or generator. As new style transfer methods are invented and/or improved, a new normalizing method Instance Normalization is formulated and used in CycleGAN.
Recent advances in NN based style transfer are:

iterative generation
newly designed layer which contains multiple style features
autoencoder based

Iterative generation is written in “Universal Style Transfer via Feature Transforms” which is to be appeared in NIPS2017. There, Instance Normalization is not used and images are generated in 5 steps, “5” corresponds to the number of blocks in VGG16/19.