Unveiling the Power of Semi-Supervised Learning: The Unified Semi-Supervised Learning Benchmark

Published in

PyTorch

8 min readJul 6, 2023

Machine Learning models thrive on high-quality, fully-annotated data. The traditional supervised learning approach typically requires data on the scale of millions, or even billions, to train large foundational models. However, obtaining such a vast amount of labeled data is often tedious and labor-intensive. As an alternative, semi-supervised learning (SSL) aims to enhance model generalization with only a fraction of labeled data, complemented by a considerable amount of unlabeled data. This blog introduces USB — the Unified Semi-Supervised Learning Framework and Benchmark, covering multi-modalities and various SSL scenarios.

Meet USB: A New More Academia-Friendly Benchmark Library with Diverse SSL Tasks

Researchers from Microsoft Research Asia, in conjunction with researchers from Westlake University, the Tokyo Institute of Technology, Carnegie Mellon University, and the Max Planck Institute, proposed USB: the first Unified Semi-Supervised Learning Benchmark for Computer Vision (CV), Natural Language Processing (NLP), and Audio classification tasks. In contrast to previous SSL benchmarks, such as TorchSSL, which solely concentrated on a limited number of vision tasks, USB offers a broad spectrum of SSL tasks, spanning multi-modalities and accommodating different practical situations.

In particular, USB encompasses standard semi-supervised learning tasks for vision, text, and audio classification, where both the labeled and the unlabeled data distributions are balanced. It also extends to more demanding scenarios like long-tailed/imbalanced SSL and open-set SSL (work in progress), wherein either or both the labeled and unlabeled data distributions may be skewed. USB’s code structure encourages easy expansion to more benchmark settings. Furthermore, USB is the first to utilize pretrained models to substantially reduce the training cost of SSL algorithms (from 7000 GPU hours to 900 GPU hours), thereby making SSL research more accessible to academic researchers, particularly those in smaller research groups. The paper detailing USB has been accepted by NeurIPS 2022.

Paper: https://arxiv.org/pdf/2208.07204.pdf

Github Repo: https://github.com/microsoft/Semi-supervised-learning

Motivation Behind USB

The past few decades have seen the flourishing of SSL, with significant strides made in Consistency Regularization and Self-Training with confidence thresholding, which have proven to be promising results. On unlabeled data, the model is encouraged to make consistent predictions for inputs under different perturbations, and a confidence thresholding mechanism is usually employed to select unlabeled data for training.

For instance, FixMatch [1] employs Augmentation Anchoring and Fixed Thresholding techniques to enhance the model’s generalization to different strongly augmented data, and reduce noisy pseudo labels. During training, FixMatch filters out unlabeled data with prediction confidence lower than a user-provided or predefined threshold. FlexMatch [3] and FreeMatch [4] introduce class-wise adaptive thresholding for a more flexible and efficient utilization of unlabeled data. SoftMatch [4] considers the issue of using unlabeled data from a re-weighting perspective. These methods are typically implemented in the Pytorch-based SSL codebase TorchSSL [5], proposed with FlexMatch [3].

Despite the fast development of SSL, we noticed that most SSL papers primarily focus on computer vision (CV) classification tasks, leaving researchers in other fields like natural language processing (NLP) and Audio processing wondering whether these SSL algorithms would work in their domains as well. Also, most work tends to focus on ideal data scenarios with balanced labeled and unlabeled data distributions, while, in reality, we often encounter imbalanced and out-of-domain data distribution. Another issue is that most SSL papers are being published by tech giants and not by academia. Academic labs, often hindered by computational resources constraints, are unable to contribute significantly to the development of semi-supervised fields. Generally, current SSL benchmarks suffer from these major issues:

(1) Insufficient Diversity: Existing SSL benchmarks are predominantly limited to computer vision (CV) classification tasks (i.e., CIFAR-10/100, SVHN, STL-10, and ImageNet classification), which neglects consistent and diverse evaluation on classification tasks in NLP and Audio, where there’s also a common issue of insufficient labeled data.

(2) Impractical Evaluation: Current SSL benchmarks mostly conduct evaluations on ideal data distribution with perfectly balanced labeled and unlabeled data. However, in practical situations, the data distribution, especially the unlabeled data, can be imbalanced and out-of-domain.

(3) Time-Consuming and Unfriendly to Academia: Existing SSL benchmarks like TorchSSL are often resource-intensive and environmentally unfriendly, requiring training deep neural network models, often from scratch. For instance, evaluating FixMatch[1] with TorchSSL takes about 300 GPU days, which puts SSL-related research beyond the reach of many research labs (especially in academia or smaller research groups), impeding the progress of SSL.

Advancements Introduced by USB

So how exactly does USB solve the prevailing challenges faced by current SSL benchmarks? It does so by incorporating the following advancements:

(1) Augmented Task Diversity: USB expands the scope of tasks by introducing 5 datasets each for CV, NLP, and audio domains. This comprehensive benchmark provides a diverse and challenging platform enabling consistent evaluation of SSL algorithms across domains and tasks. The table below gives a detailed comparison of tasks and training time between USB and TorchSSL.

(2) Facilitating Practical Evaluations: USB not only includes settings with balanced data distributions, but also more challenging settings with imbalanced and open-set data distributions. These configurations can also be expanded to different modalities that USB supports.

(3) Enhanced Training Efficiency: USB has integrated the pretrained Vision Transformer into SSL, thus eliminating the need for training ResNets from scratch. It has been observed that using a pretrained model significantly reduces the number of training iterations (for instance, it reduces the number of training iterations for CV tasks from 1 million steps to just 200,000 steps) without compromising performance.

(4) Improved User-Friendliness: The research team has open-sourced a modular codebase featuring 14 SSL algorithms along with related configuration files for easy replication. To ensure users can get started quickly, USB comes with detailed documentation and tutorials. Furthermore, USB offers a pip package — semilearn — allowing users to directly use the SSL algorithm. The team aspires to include new algorithms (such as imbalanced SSL algorithms, etc.) and more challenging datasets in USB’s future iterations. The following figure showcases the algorithms and modules currently supported by USB.

Diving into USB: Usage Examples

USB provides easy-to-use and modular codebase for training and evaluating SSL algorithms on supported benchmark, adaopting SSL algorithms on custom datasets, and designing/implementing new SSL algorithms. Detailed tutorials are provided. In this section, we will quickly go through an example of adopting any supoorted SSL algorithm on custom data.

Step 0: Import semilearn

import numpy as np
from torchvision import transforms
from semilearn import get_data_loader, get_net_builder, get_algorithm, get_config, Trainer
from semilearn import split_ssl_data, BasicDataset

Step 1: Define the config file

In USB, we provide a set of config file for each algorithm and each setting (number of labels, imabalnced ratio, etc). Before start training on custom data, users need to specify in the config file/dict, the algorithm they want to use and the hyper-parameters for the selected algorithm.

# define configs and create config
config = {
    'algorithm': 'fixmatch',  # specify which algorithm you want to use.
    'net': 'vit_tiny_patch2_32', # specify which model you want to use.
    'use_pretrain': True, # whether or not to use pre-trained models
    'pretrain_path': 'https://github.com/microsoft/Semi-supervised-learning/releases/download/v.0.0.0/vit_tiny_patch2_32_mlp_im_1k_32.pth', # the pretrained model path we have provided

    # optimization configs
    'epoch': 100,  # set to 100
    'num_train_iter': 102400,  # set to 102400
    'num_eval_iter': 1024,   # set to 1024
    'num_log_iter': 256,    # set to 256
    'optim': 'AdamW',   # AdamW optimizer
    'lr': 5e-4,  # Learning rate
    'layer_decay': 0.5,  # Layer-wise decay learning rate  
    'batch_size': 16,  # Batch size 
    'eval_batch_size': 16,    # dataset configs 
    'dataset': 'mnist', # default dataset config, can be ignored if using custom data
    'num_labels': 40,   # number of labels in the dataset, can be ignored if already specified/spliited in custom data
    'num_classes': 10, # number of classes
    'img_size': 32,  # image size 
    'crop_ratio': 0.875,
    'data_dir': './data',    # algorithm specific configs
    'hard_label': True,
    'uratio': 2,
    'ulb_loss_ratio': 1.0,    # device configs
    'gpu': 0,
    'world_size': 1,
    "num_workers": 2,
    'distributed': False,
}
config = get_config(config)

After specifing the config, we can load the algorithm (with specified parameters):

# create model and specify algorithm
algorithm = get_algorithm(config,  get_net_builder(config.net, from_name=False), tb_log=None, logger=None)

Step 2: Load Custom Data

Next step is loading the custom data. If in your custom data, you already have labeled data and unlabeled data spliited, you can load them using the Dataset provided in USB direclty.

lb_data = np.random.randint(0, 255, size=3072 * 1000).reshape((-1, 32, 32, 3))
lb_data = np.uint8(lb_data)
lb_target = np.random.randint(0, 10, size=1000)

ulb_data = np.random.randint(0, 255, size=3072 * 5000).reshape((-1, 32, 32, 3))
ulb_data = np.uint8(ulb_data)
ulb_target = np.random.randint(0, 10, size=5000)train_transform = transforms.Compose([transforms.RandomHorizontalFlip(),
                                      transforms.RandomCrop(32, padding=int(32 * 0.125), padding_mode='reflect'),
                                      transforms.ToTensor(),
                                      transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])])train_strong_transform = transforms.Compose([transforms.RandomHorizontalFlip(),
                                             transforms.RandomCrop(32, padding=int(32 * 0.125), padding_mode='reflect'),
                                             transforms.ToTensor(),
                                             transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])])lb_dataset = BasicDataset(config.algorithm, lb_data, lb_target, config.num_classes, train_transform, is_ulb=False)
ulb_dataset = BasicDataset(config.algorithm, lb_data, lb_target, config.num_classes, train_transform, is_ulb=True, strong_transform=train_strong_transform)

If you don not have spliited data but only want to experiment on some complete academic datasets, you can use the api provided for splitting the complete data:

# replace with your own code
data = np.random.randint(0, 255, size=3072 * 1000).reshape((-1, 32, 32, 3))
data = np.uint8(data)
target = np.random.randint(0, 10, size=1000)
lb_data, lb_target, ulb_data, ulb_target = split_ssl_data(config, data, target, 10,
                                                          config.num_labels, include_lb_to_ulb=config.include_lb_to_ulb)

Then create the evaluation dataset:

eval_data = np.random.randint(0, 255, size=3072 * 100).reshape((-1, 32, 32, 3))
eval_data = np.uint8(eval_data)
eval_target = np.random.randint(0, 10, size=100)

eval_transform = transforms.Compose([transforms.Resize(32),
                                      transforms.ToTensor(),
                                      transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])])eval_dataset = BasicDataset(config.algorithm, lb_data, lb_target, config.num_classes, eval_transform, is_ulb=False)

Wrap the datasets to dataloaders:

train_lb_loader = get_data_loader(config, lb_dataset, config.batch_size)
train_ulb_loader = get_data_loader(config, ulb_dataset, int(config.batch_size * config.uratio))
eval_loader = get_data_loader(config, eval_dataset, config.eval_batch_size)

Step 3: Train and Evaluate

Training and evaluation can be done in 3 line of code with semilearn:

trainer = Trainer(config, algorithm)
trainer.fit(train_lb_loader, train_ulb_loader, eval_loader)
trainer.evaluate(eval_loader)

More Examples

More examples on how to use USB can be found in our repo!

Looking Forward: The Future of USB

As we look towards the future of USB, we have a clear vision in mind. We aim to expand the functionality and usability of USB. We plan to extend USB to more practical settings, tackling more complex and challenging real-world data distributions. A focus will be on dealing with imbalanced datasets and out-of-domain data. We are also looking at the wider application of pre-training within USB’s SSL algorithms. By leveraging pre-trained models, we anticipate significant improvements in performance and efficiency. Lastly, we aim to create an open and vibrant research community around USB. We plan to continually integrate state-of-the-art SSL algorithms into our codebase and encourage contributions from researchers globally. Join us on this exciting journey as we aim to revolutionize the landscape of semi-supervised learning and set new benchmarks in machine learning research.