Member-only story
Self-Training for Natural Language Understanding
This article will explain an exciting development in Natural Language Processing. The paper presents a Semi-Supervised Learning algorithm that significantly improves RoBERTa’s performance with Self-Training. If you prefer a video explanation of the paper, please check this out!
Transfer Learning has been extremely successful in Deep Learning. This describes initializing a Deep Neural Network with weights learned from another task. In Computer Vision, this other task is commonly ImageNet Supervised Learning. In Natural Language Processing, this other task is commonly Self-Supervised Language Modeling with an internet-scale corpus.
The success of Transfer Learning has inspired Deep Learning researchers to explore more tasks to use in pre-training. A promising alternative task is Self-Training. Self-Training is the inverse of Knowledge Distillation, which was developed to compress large Deep Neural Networks.
Self-Training and Knowledge Distillation describe using one neural network to label the training data for another. Knowledge Distillation uses the larger network to label the data of the smaller network, and Self-Training uses the smaller network to label the data of the larger network. This may loop for several…