Notes on Deep Learning Tech talk : From supervised to unsupervised learning
This is my attempt at summarizing the main points from VietAI’s Techtalk 2 that took place on Saturday, June 23 2018.
First speaker: Trinh H. Trieu
Trieu majored in Computer Science last year and is currently a resident at Google AI (formerly Google Brain). His works there focuses on Natural Language Processing (NLP) so in this talk he presented two papers which he is the first authors during his time at Google and all are NLP-related.
A Simple Method for Commonsense Reasoning (Trieu H. Trinh, Quoc V. Le)
- At first, Trieu introduced about Language model in the context of Natural Language Processing. A language model is a probabilistic model, that given an input sentence, it will output the the probability of that sentences makes sense. The current state of the art (SOTA) probabilistic Language model are built using Recurrent Neural Network (RNN) architecture.
- Example: P(I come from Vietnam) > P(Vietnam comes from me). A good language model will give the probability for the first sentence much higher than that of the second sentence even though they are very familiar.
- Language models are great for transfer learning. In the past year a lot of low-hanging fruits in NLP research have been collected by applying transfer learning from pre-traied language models: SOTA, SOTA,SOTA everywhere.
- This paper use language model to solve the problem of “Common sense reasoning”.
Trieu then proceeded to introduce about the Winograd Schema Challenge. The challenge comprise the task to answer a lot of question in a single format:
The trophy would not fit in the brown suitcase because it was too big. What was too big?
Answer 0: the trophy
Answer 1: the suitcase
- The random baseline is 50% and the previous SOTA that used pre-trained Word2Vec and a supervised DeepNN achieved around~53%
- The method in this paper achieved around 64%. It’s still far from human capabilities but (he thinks) it’s a step in the right direction.
- Train a good language model.
- Replace the pronounce with one of the answer. Ex: “The trophy would not fit in the brown suitcase because the trophy was too big”.
- Use the language model to calculate the probability of the new sentence.
- Do the same for the second answer “the suitcase”.
- The noun correspond to the sentence that has higher probability is the correct answer.
The unsupervised part of this paper is that it used a language model that has been trained on large text corpus. It does not need any labeled example from the Winograd Challenge to be the SOTA for this task.
Even though the challenge does not include this, but the method in this paper are also able to detect “special word” — if you change that word the correct answer will be the other one. Ex: in the previous example, if “big” is replaced with “small”, the correct answer will change from “trophy” to “suitcase”.
Does this paper contribute to solving the Winograd Schema Challenge? Yes.
Does Winograd Schema Challenge important to develop a model with “Commonsense reasoning”? Yes.
Is the probabilistic model in this paper reasoning commonsense? I don’t think so.
I think that as long as a model can not tell what is a trophy, what is a suitcase, what is the usage of a suitcase (to put stuffs inside), it can hardly be called commonsense reasoning. A probabilistic approach can not be “reasoning” if all it can learn is the probability of appearing together of some words.
How can we build a commonsense reasoning model? I don’t know (yet), but a probabilistic model at best will only be a small part of this system, if we can ever build one.
To reproduce the result in this paper or download his pre-trained languages models, check out the Tensorflow master branch: https://github.com/tensorflow/models/tree/master/research/lm_commonsense
Learning Longer-term Dependencies in RNNs with Auxiliary Losses (Trieu H. Trinh, Andrew M. Dai, Minh-Thang Luong,Quoc V. Le)
In this paper he tackled the problem of learning long-term dependencies in some supervised tasks.
- Trieu first described the success of using language model on Stanford Question Answering Dataset (SQuAD), how some NLP-tasks have achieved super-human level.
- In additions to many problems in NLP such as Language Understanding, etc, two problems that Trieu thinks are also important are “Long-term dependencies” and “Time and computational resources”.
- This paper focuses on solving Long-term dependencies in a language model with an eye on time and computational constrains.
- He uses auxiliary losses.
- Instead of just having one final loss and all the gradients flow from the very end in a long sequence of RNN units, now with auxiliary losses we have multiple streams of gradient coming from different part of the sequence.
- The task of the auxiliary losses is simply to reconstruct the input sequence. Therefore the majority of the RNNs unit are biased toward remembering information. Remember is important and the results turn out quite well.
To prevent the model from becoming too big because of those auxiliary losses and gradient, he uses truncated backprop through time to limit the flow of the gradients to only a few steps back.
- Trieu then proceeded on demonstrating a toy example to (kind of) prove how these auxiliary losses and truncated backprop go hand in hand and work nicely together.
- Then he compares his result with some standard baseline and SOTA models. Thanks to the truncated backprops in both the main loss and the auxiliary losses, he was able to train and test on very long text sequences (up to 16K)
- One thing that seems to confused the audiences that is the gradient of the main supervised tasks are backward propagated only a few steps back. The earlier RNN units are mostly adjusted by the gradients of the auxiliary remembering losses.
- Finally he showed that this method is both a regularization method and an optimization method. It’s promising to apply this to all other model in Deep Learning.
- The unsupervised part of this paper is the auxiliary losses. By defining the task for these outputs as simple as return the output, we don’t need any labeled data for these auxiliary outputs at all.
Second speaker: Hung Bui
Hung H. Bui is a senior Machine Learning researcher and his interests span over many fields. He has a long track of researching and developing many high-impact Machine Leaning systems. He is currently working as a Research Scientist at Google Deep Mind.
In this talk he introduced about the problem of Domain Adaptation and his recent DIRT-T paper on this task.
It is the task of quickly adapt a model that has been trained on a different dataset to another dataset with a different data distribution.
- Non-stationary past to present data: classification model previously trained on MNIST data need to work on SVHN data.
- Simulation to Realistic: DeepMind focus on AI on game (Go, Starcraft), apply that AI model to real world is the key.
- One country to others: Use large corpus of English text data to build NLP model on Vietnamese.
My note: Build a self-driving car on GTA and bring that model to a Tesla.
Relationship to Semi/Unsupervised learning:
He discussed the theoretical and mathematical background of this problem to Unsupervised learning. There are two main points:
- Leverage the Cluster assumptions by maximizing certainty of the classifier and make sure small neighborhood near a data point is label-consistent (by KL-divergence, etc).
- Relationship to Covariate Shift problem: if we don’t know the distribution of the new domain data, but we know the likelihood ratio between original domain and new domain, we can correct the biases simply by using this likelihood ratio.
There are some even more difficult scenarios in Domain Adaptation:
- Supports of the two distributions might not be the same.
- Target classifiers might change.
- There is no supervision for the targeted domain. -> Focus of his paper.
A DIRT-T Approach to Unsupervised Domain Adaptation (Rui Shu, Hung H. Bui, Hirokazu Narui, Stefano Ermon)
He then introduces about Domain Adversarial Training, which is an important step toward his paper.
- In addition to the classification loss (blue part), we have and adversarial loss(pink part) branches out at the end of feature extractor (green part) module.
- The purpose of the adversarial loss is to force the encoder to learn the hidden representations of the original domain and the target domain such that they are indistinguishable from each other. For example the number 1 in MNIST data set and the number in 1 SVHN data set may look very different but its features extracted by the encoder are the same.
Learning from a different source distribution:
- Task: Learn a classifier on SVHN data set.
- Train data: Labeled MNIST data and unlabeled SVHN data.
- Test data: Labeled SVHN data.
This way we can train a classifier on SVHN data set without any label, thus unsupervised.
In the figure above of pure Domain Adversarial Training, when the input data is from MNIST, we have gradients flow from both classification loss and adversarial loss, but in the case of SVHN input, we don’t have gradient from classifier as we don’t have the label. Leveraging the Cluster Assumption, the author introduces Virtual Adversarial Domain Adaptation (VADA).
DIRT-T: Decision boundary Iterative Refinement Training — with a Teacher
It’s pronounced as dirty. I wonder how long does it take the authors to come up with that pun.
- DIRT-T uses VADA as initialization. It uses the source domain information to initialize the decision boundary for the target domain.
- Then it take advantage of the Cluster Assumption to adapt this decision boundary to the target domain.
- The key here is to change the decision boundary slowly and gradually. If the change is too abrupt, the decision boundary will be spoiled. This is the reason for “Iterative Refinement” in the name.
- The authors uses natural gradients to keep the new decision closes to existing solution.
The speaker then went on to show the evaluation result on different source and target domain such as: MNIST -> MNIST-M, MNIST -> SVHN, SVHN -> MNIST, etc. This method achieved SOTA on all the test data set.
He also showed an empirical analysis of the method using T-SNE visualization method on 2D:
If a classifier is trained on red data only, the classifier has no reason to work on the blue data because they have very different distribution. After it has been trained with DIRT-T, the data distribution of both source and target domain are clustered similarly.
Conclusion: Cluster Assumption — It works.
Github code: https://github.com/RuiShu/dirt-t
Questions from audiences
Most of the question are technical, the answer can be more or less obtained by reading the papers and inspect the published source code. That asides there are some interesting questions:
- Why are these method called unsupervised learning? For example the auxiliary loss in Trieu’s first paper, the loss still need to be calculated, it’s still a mapping from x to y, why is it called unsupervised?
- Trieu answered: If not mapping from x to y, then mapping from x to what?
My note: I think calling it unsupervised or not can be defined in a task level, not model level. So in case of Winograd challenge, if the end model doesn’t use any label from Winograd questions but only learn from text data in the language model, it can be called unsupervised. In the case of DIRT-T, even though we still use MNIST labels, but the task is to classify SVHN data so this model is still unsupervised because it doesn’t use any SVHN label.
Want to work on something exciting like this? We’re hiring for Deep Learning / AI Engineer positions at Vitalify Asia. Apply here.