Bayesian Optimization Meets Self-Distillation

Published in

Lunit Team Blog

8 min readMay 8, 2024

Introduction

At Lunit, the accuracy of our models is very important because they are directly used in medical diagnostics and treatments that can save lives. The performance of deep learning models is highly sensitive to the selection of hyperparameters such as learning rate, weight decay, data augmentation, and resampling imbalanced data. However, manually tuning these hyperparameters is not only inefficient but often beyond human capability due to the high-dimensional nature of the problem.

Difficulty of tuning high-dimensional hyperparameter search space (Drawn with DALL-E)

To overcome these challenges, we have utilized Bayesian optimization (BO), which iteratively suggests promising hyperparameters based on the performance outcomes of prior trials. However, BO tends to overlook a valuable asset: the knowledge within the model from previous trials. We realized that there was a significant opportunity to enhance this process. Inspired by recent advances in self-distillation (SD), which has shown that knowledge transfer from a previously trained model can improve performance, we decided to integrate SD into BO to fully leverage the knowledge obtained from previous trials.

We call this integrated framework Bayesian Optimization meets Self-diStillation (BOSS) and our paper has been accepted for presentation at ICCV 2023. Following the BO process, BOSS suggests hyperparameter configurations based on observations that are most likely to improve performance. After that, it carefully selects pre-trained networks from previous trials for the next round of training with SD, which are otherwise abandoned in the conventional BO process. This process is performed in an iterative manner, allowing the network to persistently improve upon previous trials. Let’s explore in detail how the BOSS works and how it has significantly enhanced the capabilities of our models.

Performance curves of various methods on CIFAR-100 with VGG-16 architecture. BOSS persistently improves upon previous trials by leveraging prior knowledge as a stepping stone to push the performance boundary.

What is Bayesian Optimization?

Bayesian Optimization (BO) is a method used to optimize complex functions that are expensive to evaluate. It is effective for tuning hyperparameters in machine learning models, where each evaluation can be computationally intensive.

An animation of Bayesian optimization (source)

Key Elements of Bayesian Optimization

Surrogate Model: BO uses a probabilistic model to approximate the objective function (e.g. accuracy). This model predicts the outcome based on the hyperparameters used.
Acquisition Function: This function guides the selection of the next set of hyperparameters to evaluate by balancing exploration (testing new areas) and exploitation (focusing on promising areas).

How It Works

BO builds the surrogate model from initial random evaluations of the objective function.
It then enters a loop where the acquisition function selects the next hyperparameters to evaluate.
The objective function is evaluated, and the surrogate model is updated.
This process repeats, refining the hyperparameter choices each time to converge on the optimal solution.

Understanding Self-Distillation

Self-distillation (SD) is a technique that enhances model performance by transferring knowledge from the teacher model to student model, typically within the same architecture. Recently, the effectiveness of SD has been theoretically explained by the “multi-view” hypothesis introduced by Allen-Zhu and Li, who showed that SD performs an implicit ensemble of various models.

Key Concepts of Self-Distillation

Teacher and Student Models: In self-distillation, the teacher is a model that has already been trained. The student model is the same architecture that learns from the teacher. The idea is that the student model can exceed the teacher’s performance by learning from its knowledge.
Knowledge Transfer: This process involves the student model learning to replicate the behavior of the teacher model. By training the student to mimic the teacher’s output distributions, the student can learn deeper insights that are not obvious just from looking at the training data.

How It Works

The teacher model is first trained following the standard training procedure.
The student model is then trained not only to achieve good performance on the training data but also to mimic the output of the teacher model. This dual objective helps the student model capture the knowledge that the teacher model has learned.
Performing multiple rounds of SD could further improve the performance where the trained student is set to be a new teacher in the following round.

BOSS Framework

BOSS framework introduces a novel approach for combining the hyperparameter optimization capabilities of Bayesian Optimization (BO) with the knowledge retention benefits of Self-Distillation (SD). This integration is not merely a simple combination but a strategic fusion that addresses specific challenges during model training.

How It Works

In order to perform SD, a teacher network is required to train a student network. However, the absence of any network in the beginning, poses a cold start problem. To address this issue, a warm-up phase is introduced which is similar to the regular BO process. Therefore, BOSS framework operates in two distinct phases:

Phase 1: Warm-up

Objective: This phase aims to address the cold start problem by providing a set of initial models that will be utilized in the subsequent optimization phase.
Process: Models are trained with the standard Bayesian Optimization process to explore and evaluate different hyperparameter configurations.

Phase 2: BOSS Training

Objective: In this phase, BOSS leverages both the hyperparameter configurations suggested by BO and the models trained during the warm-up phase to aggregating knowledge from various models.
Process: BOSS randomly selects from the high-performing models and initialize the teacher and student networks. The student models are trained not only to perform well on the training data but also to emulate the outputs of their corresponding teacher models, effectively transferring the distilled knowledge.

Strategic Initialization from Prior Knowledge

A critical aspect of the BOSS framework’s effectiveness is its strategic approach to initializing both teacher and student models:

Dual Initialization: Both teacher and student models are initialized from the high-performing models of past trials to fully leverage the knowledge from various models.
Diverse Origins: To maximize the exploitation of this prior knowledge, teacher and student models are initialized from different training trials. This diversity allows the student model to benefit from a broader spectrum of learned behaviors and insights, potentially leading to a improved performance.

Experiments

To validate the effectiveness of the BOSS framework, a series of experiments were conducted across various tasks, including general image classification, learning with noisy labels, semi-supervised learning, and medical image analysis. These experiments were designed to test the framework’s performance against traditional Bayesian Optimization (BO), Self-Distillation (SD), Random search (Random), and standard training methods.

General Image Classification

The BOSS framework exhibits considerably higher accuracy than other methods across all tested datasets. While random search succeeds to improve the performance of the baseline, BO further boosts the performance by adaptively suggesting probable configurations. SD also achieves enhanced performance compared to the baseline as expected. However, the effectiveness of SD and BO varies across datasets. On the other hand, BOSS consistently improves the performance by a large margin, leveraging the advantages of both methods.

Top-1 accuracy (%) on CIFAR10/100 and Tiny-ImageNet with VGG-16. The reported results are the average and the 95% confidence interval over 5 repetitions.

Learning with Noisy Labels

BOSS demonstrated superior resilience to label noise compared to the other methods. Its ability to transfer clean and robust features from teacher to student models helped mitigate the impact of erroneous labels, resulting in higher accuracy. This is particularly important in our context because the ground truth of medical data is often noisy.

Comparison with Different Noise Rates on CIFAR-100

Semi-Supervised Learning

In scenarios with limited labeled data, BOSS leveraged both labeled and unlabeled data more effectively than other methods. It is worth noting that the baseline is one of the state-of-the-art SSL method trained with carefully tuned hyperparameters, advanced regularization techniques, and long training iterations. By distilling knowledge through iterations, BOSS could utilize unlabeled data to enhance the learning process, showing significant improvements. This aspect is crucial because annotating medical data is expensive, and there exists a large amount of unlabeled data.

How Automated Hyperparameter Optimization Improved Our Models at Lunit

At Lunit, there exists a deep learning training platform called INtelligent CLoud (INCL). BOSS framework is integrated into INCL platform along with other improvement on existing hyperparameter optimization (HPO) algorithm. This setup allows for straightforward execution of automated HPO, leading to a substantial improvement in how we develop AI models.

Previously, each product at Lunit required extensive manual tuning, which was time-consuming and often yielded suboptimal results. By automating this process through our advanced HPO system, we have not only freed up valuable research time but also consistently achieved a better performance.

The automatic HPO on INCL has been thoroughly validated across a wide range of internal products. These include diagnostic imaging such as chest x-rays and mammography as well as tissue segmentation and cell detection from histology images. Each of these areas presents unique challenges. However, our HPO consistently improved performance across all tasks.

AUC Scores for Chest X-Ray: HPO demonstrates superior or comparable performance across various datasets

In addition to improving performance on validation datasets, our automatic HPO has also demonstrated enhanced effectiveness on test sets. This indicates that the models are not merely overfitting to the validation data but genuinely learning to generalize better from the training data. Improved test set performance indicates practical utility of our models in real-world scenarios, ensuring that our products remain reliable and effective when deployed in clinical settings.

AUC Scores for 3D mammography: Automatic HPO boosts performance on test sets, ensuring reliable and effective real-world application in clinical settings

mF1 and mIoU Scores for Cell Detection and Tissue Segmentation from Histology Images

Conclusion

In this blog post, we presented the BOSS framework, which synergistically combines Bayesian Optimization and Self-Distillation, allowing for more efficient hyperparameter tuning and enhanced knowledge retention. Our comprehensive testing across various tasks has consistently shown that BOSS achieves significant performance improvements, consistently outperforming standard BO or SD when used alone. The BOSS framework is integrated into our Intelligent Cloud (INCL) platform, facilitating easy-to-use automated hyperparameter optimization. This integration reduces the time and expertise required for manual tuning while achieving superior model performance. We have made our code available at Github as well!

There are still many areas where we could improve deep learning process in Lunit. If you’re passionate about building and optimizing deep learning system and want to work on cutting-edge technology that’s making a big impact in the industry, consider joining Lunit’s team!

Acknowledgements

Many thanks to Heon Song, Hyeonsoo Lee, Gi-hyeon Lee, Suyeong Park and Donggeun Yoo for making the research possible.

Bayesian Optimization Meets Self-Distillation

Introduction

What is Bayesian Optimization?

Key Elements of Bayesian Optimization

How It Works

Understanding Self-Distillation

Key Concepts of Self-Distillation

How It Works

BOSS Framework

How It Works

Strategic Initialization from Prior Knowledge

Experiments

General Image Classification

Learning with Noisy Labels

Semi-Supervised Learning

How Automated Hyperparameter Optimization Improved Our Models at Lunit

Conclusion

Acknowledgements

Written by HyunJae Lee