Semi-supervised learning (note of ML class of Hung-Yi Lee)

Published in

Analytics Vidhya

5 min readFeb 5, 2020

Lecture video: https://www.youtube.com/watch?v=fX_guE7JNnY&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&index=21

It is a good tutorial lecture for everyone in semi-supervised learning. If you understand Chinese, this lecture will be a good choice.
All figures inside this post come from this lecture.

The very first of semi-supervised learning is that we have to know the difference between ‘transductive’ and ‘inductive’ learning.
— Transductive learning: use testing set as unlabeled data, but not use the label of testing set.
— Inductive learning: find more data as unlabeled data.

semi-supervised learning in generative model

How semi-supervised learning is implemented in generative model?

Initialize the parameters of model by random initialization or by pre-trained model.
Step 1: compute the posterior probability of unlabeled data
Step 2: update the prior probability and mean of unlabeled data
Use the new prior and mean back to step 1 to update the posterior probability

The whole process looks like EM( Expectation Maximization ) algorithm, where step 1 is ‘E’ and step 2 is ‘M’.

2 assumptions are often used in implementation of semi-supervised learning, which are low-density separation and smoothness assumption.

Low-density separation (Black and white world)

the most typical case of low-density separation in semi-supervised learning is self-training

Although self-training looks much like semi-supervised learning in generative model, we often use hard label in self-training and use soft label in generative model. Why? If we use soft label in NN model, for example, [0.7 0.3] for the new target, we’re unable to update the model because the new target is the same as the original label.

how unlabeled data be able to update the model

How to let ‘unlabeled data → labeled data’ influence the model training? Modify the loss function! We calculate the entropy of the calculated output(new label) and add it to loss function as a regularized term. We all know that the NN training focus on minimizing the loss, therefore the entropy of the new label is also in the minimization process.

Why it works? Take a look at the left part of the above picture. While doing classification, we always expect the output can be ‘simple’. If the probability of all classes are much close, we’ll consider the model still require more time to train. And it is bad. According to this concept, we know that the smaller the output entropy, the better the output. Further, it meets the assumption of low-density separation.

Smoothness assumption

The concept of smoothness assumption is that if the points are in the same high density cluster, the probability of being the label is high. In other words, if we find the close points belong to the same label, this model(function) is smooth.

We’ll use 2 examples to explain this concept.

In example 1, if NN saw picture 1, 2 and 3 in the beginning, it may think picture 2 and 3 are belong to the same label. As we collect more data (picture 4~7), we’ll discover that picture 1, 2 and 4~7 may ‘in the same high density region’, therefore they should be the same class.

The example 2 is the document classification. In the beginning, there are only document 1, 2, 3 and 4 in the data set, where document 1 is class 1, document 4 is class 2 and document and 3 are unlabeled data. It is hard for model to do the right classification of document 2 and 3. After we collect more documents, we’ll be able to do the right classification.

We can extend the cluster to the graph as well.

If the nodes are in the same connected graph, the higher probability they’ll be the same class. The troublesome is the number of data set.

We use deep autoencoder to extract the data(both labeled and unlabeled data) feature, and do clustering. Then connect the neighbors to form a graph. We can also calculate the reciprocal of Gaussian RBF as the weight of edge.

After the graph is constructed, how to combine the concept of graph and semi-supervised learning?

(Figure 1) We define the smoothness of graph. The node embedding of graph example in figure is scalar, however the node embedding is usually a matrix in real world.

(Figure 2) The smoothness function of graph in matrix form.

(Figure 3) Add smoothness to the loss function as a regularization term. Then update the model parameter. Then again calculate the smoothness….

Semi-supervised learning (note of ML class of Hung-Yi Lee)

Low-density separation (Black and white world)

Smoothness assumption

Written by Cecile Liu