A Primer on Semi-Supervised Learning — Part 1

Neeraj Varshney
Analytics Vidhya
Published in
5 min readJun 27, 2020

Semi-Supervised Learning (SSL) is a Machine Learning technique where a task is learned from a small labeled dataset and relatively larger unlabeled data. The objective of SSL is to perform better than a supervised learning technique trained using labeled data alone. This is Part 1 of the article series on Semi-Supervised Learning and gives a brief introduction to this important sub-domain of Machine Learning. Future parts cover SSL approaches in detail.

Photo by Franck V. on Unsplash

Outline Part 1:

  1. Distinguishing Semi-supervised Learning from Supervised and Unsupervised Learning?
  2. Why should we care about Semi-Supervised Learning?
  3. Examples of Semi-Supervised Learning Tasks
  4. Conclusions and future Parts

Outline Future Parts:

  1. Consistency Regularization, Entropy Minimization, and Pseudo Labeling
  2. Approaches for Semi-Supervised Learning
    — Π model
    — Temporal Ensembling
    — Mean Teacher method
    — Unsupervised Data Augmentation
    — MixMatch

Part 2 is available here.

Distinguishing Semi-supervised Learning from Supervised and Unsupervised Learning?

The extent of labeled data in the entire dataset available for training distinguishes these three related fields of Machine Learning.

Supervised Learning is the most popular paradigm of Machine Learning where full supervision is available in the form of labels. The entire dataset is labeled i.e a label is associated with every example in the training dataset. A machine learning model is trained using this labeled dataset and is expected to predict the label for a new example at test time. Supervised Learning primarily covers two kinds of tasks: Classification and Regression. A Classification problem asks the algorithm to predict a discrete value while a Regression task is about approximating a mapping function (f) from input variables (X) to a continuous output variable (y). Let’s see a few examples of Supervised Learning tasks:

Handwritten digit classification task using MNIST dataset. Each record has an image and corresponding digit as the label. The task is to learn to predict the label (i.e digit) from an image.

MNIST Images with labels 4,5,6 and 7 respectively.

Another example is Sentiment classification using IMDB reviews dataset. Each record contains a review and a corresponding label (positive or negative). Here, the task is to predict the sentiment of a given review.

House Price Prediction is a Regression task where the label (house price) is a continuous variable.

An interested reader can refer to this post for a detailed review of Supervised Learning.

In Unsupervised Learning, no labeled data is available. The training dataset contains examples without a specific desired outcome or a label. A Machine Learning model attempts to automatically find structure in the data by extracting useful features and analyzing them. Tasks like Clustering, Anomaly Detection, Association, etc. fall under Unsupervised Learning.

Clustering is the task of dividing the dataset into a number of clusters such that data points in the same clusters are more similar to other data points in the same cluster and dissimilar to the data points in other clusters. For instance, data points in the below figure (Left) can be divided into 3 clusters as shown(Right). Note that the clusters can be in any shape.

A Clustering example. Three clusters can be identified from the Left Plot.

Semi-Supervised Learning(SSL), as the name indicates is in between the two extremes (supervised where the entire dataset is labeled and unsupervised where there are no labels) in terms of availability of labeled data. A semi-supervised learning task is accompanied by a labeled and an unlabeled dataset. It uses unlabeled data to gain more understanding of the data structure. Typically, SSL is performed using a small labeled dataset and a relatively larger unlabeled dataset.

The goal is to learn a predictor that predicts future test data better than the predictor learned from the labeled training data alone.

Difference between Supervised, Semi-Supervised, and Unsupervised Learning in terms of availability of labeled data(colored dots). Source: KDnuggets.

Why should we care about Semi-Supervised Learning?

In many real-world applications, it is either too expensive or not feasible to collect large labeled datasets but a large volume of unlabeled data is available. For such scenarios, Semi-Supervised Learning is a perfect fit. SSL techniques can leverage the labeled data and also derive structure from the unlabeled data to solve the overall task better.

When the size of labeled dataset is small, typical supervised learning algorithms are vulnerable to overfitting. SSL alleviates this issue by understanding structure from the unlabeled data in the training process.

Furthermore, such learning techniques relieve the burden of building huge labeled datasets to learn a task. SSL methods are a step closer to the way we humans learn.

Let’s take an example to visually see the efficacy of Semi-Supervised Learning. In the figure below, when trained on only the labeled data (large black and white dots) (i.e Supervised Learning on labeled data), the decision boundary (dashed line) does not follow the contours of the data “manifold”, as indicated by additional unlabeled data (small grey dots). So, the objective of SSL is to utilize the unlabeled data to produce a decision boundary that better reflects the data’s underlying structure.

Decision Boundary found by various SSL approaches on the “two moons dataset”. From Paper: Realistic Evaluation of Deep Semi-Supervised Learning Algorithms.

Examples of Semi-Supervised Learning Tasks

  1. CIFAR-10 — It is a dataset consisting of 32 × 32 pixel RGB images from ten classes and the task is image classification. Random images from the Tiny Images dataset are typically used to form the unlabeled dataset.
  2. SVHN — The street view house numbers dataset consists of 32 × 32 pixel RGB images of real-world house numbers, and the task is to classify the centermost digit. It is accompanied by “SVHN-extra” dataset, which consists of 531,131 additional digit images which can be used as unlabeled data.
  3. Text-Classification Tasks — Amazon Reviews dataset, Yelp reviews dataset.

Conclusions and future Parts:

Semi-Supervised Learning is an interesting approach towards addressing the lack of labeled data issue in Machine Learning. SSL algorithms also utilize unlabeled data to improve performance over supervised learning algorithms. SSL algorithms generally provide a way of learning about the structure of data from unlabeled examples, alleviating the need for labels.

Interested readers can also read my articles on Few-shot learning which is a related field.

Part 2 of this series covers a few SSL techniques and is available here.

References:

  1. Difference Between Classification and Regression in Machine Learning.
  2. Clustering in Machine Learning.
  3. Realistic Evaluation of Deep Semi-Supervised Learning Algorithms.

--

--

Neeraj Varshney
Analytics Vidhya

Looking for full-time positions | Ph.D. Candidate working in Natural Language Processing (https://nrjvarshney.github.io)