Read Paper with Me: Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Interpretation of data2vec, a model for learning across modalities
- This blog introduces a new paper on self-supervised learning from Meta AI: data2vec: A General Framework for Self-supervised Learning in Speech, Vision, and Language
- If you have a hard time understanding this blog, it’s recommended to read my blog on BYOL first. This work is built upon BYOL.
Background: Traditional machine learning relies on labeled data for training. However, the annotation of data is costly and laborious. In a larger sense, it’s impossible to label all data in the world. Recently, self-supervised learning (SSL) attracts attention as a promising method to solve this problem and for approximating the common sense in AI systems, eventually achieving Artificial General Intelligence (AGI).
Instead of using supervised signals from labeled data, SSL exploits the relationship between data. However, different modalities typically need different models for SSL. This paper hence proposes a unified framework called data2vec (the name is a play on another famous algorithm word2vec) for SSL in three modalities: images (in the original paper, computer vision refers to this), texts (sometimes referred to as language in the original paper), speech. Data2vec achieves state-of-the-art (SOTA) results for all three modalities.
Before we move on, it’s important to notice the difference between these three modalities: Images are 2D structured data. Texts are discrete 1D data and speech is continuous 1D data.
*This work is NOT for training in multimodal learning (i.e., during training, only one of the three modalities is passed as an input, not a mixture of them) but can be helpful for multimodal learning
Method Overview: data2vec uses one model but has two modes: the teacher mode and the student mode. In each time step, the student mode of data2vect will try to learn from the teacher mode and update the model parameters
Specifically, the teacher mode generates representation from a given sample (i.e. image, speech, text). A masked version of the same sample is passed to the student mode. Learning happens by minimizing the objective function between the student’s prediction of a target that is constructed by teachers’ parameters.
Method:
- Model Architecture
The samples are turned into representations by the data2vec model. A representation of the full sample comes from the teacher, and a representation of the masked sample comes from the student.
These are contextualized representations, meaning that they encode particular timestep as well as other information from the sample, due to the use of self-attention in the Transformer. This is the major difference between this work and previous works, which lack contextual information.
2. Target Construction
An exponential moving average (EMA [7, 8], which places greater weights for more recent data points) is used to update the teacher’s parameters. The τ here is linearly increasing, and allows more frequent updates of the teacher at the beginning of training.
The target is then constructed by using the top K (closer to the output) blocks of the transformer.
3. Objective Function
In a word, the teacher and the student have a dynamic behavior: student’s parameters are updates by optimizing the obejective function in step 3, while the teacher’s parameters are updated by calculating the the EMA in step 1. This dynamic is believed to prevent the model from collapsing into a constant representation [7]. I will write another blog to introduce [7] in the future.
Results:
The paper reports SOTA results on all three modalities.
- Image (metric: accuracy. Higher value, better performance):
2. Speech (metric: word error rate. Lower value, better performance)
3. Texts (metric: GLUE score. Higher value, better performance)
Ablation Study:
- Top K blocks
The paper argues that using the average of top K blocks in the teacher mode is better than using just the top one (In this section in the paper, the authors use “top K layers”, which is inconsistent with “top K blocks” in the “Targets” section in the paper. I’m assuming “layers” and “blocks” are used interchangeably and both mean the blocks in the Transformer).
2. Target Feature Type
Rather than just use the top K blocks, the authors also tried using different parts of the teacher mode and found that using the FFN is the best.
Conclusion: The paper introduces a new general self-supervised learning framework and achieves SOTA performance for three modalities.
The framework involves one model with two modes: teacher and student. The teacher is given a full sample input while the student is given a masked input of the same sample. Self-supervised learning is achieved by letting the student learn from the teacher.
Personal Remarks:
- It’d be more interesting to see how this method performs for unstructured modality, e.g. graphs :P
- Transformer and BYOL play an important role in the success of this method. Transformer is a flexible architecture not constraint to a specific modality, so it can be applied to different modalities. And BYOL provides the core self-supervised learning part of this method.
- This work serves as a key step for unifying inputs from different modalities. As humans are likely to use a similar learning process to understand the visual world as they do for language [9, 10], this work is significant in bringing us closer to AGI.
References:
[1] Bao, H., Dong, L., and Wei, F. Beit: BERT pre-training of image transformers (2021). ArXiv
[2] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale (2020). arXiv
[3] Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations (2020). In Proc. of NeurIPS
[4] Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units (2016). In Proc. of ACL
[5] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding (2019). Proc. of NAACL
[6] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need (2017). In Proc. of NIPS
[7] Grill, J.-B., Strub, F., Altche, F., Tallec, C., Richemond, ´ P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., Piot, B., Kavukcuoglu, K., Munos, R., and Valko, M. Bootstrap your own latent: A new approach to self-supervised learning (2020). arXiv.
[8] Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bo- ´ janowski, P., and Joulin, A. Emerging properties in selfsupervised vision transformers (2021). arXiv
[9] Friston, K. and Kiebel, S. Predictive coding under the freeenergy principle (2009). Philosophical transactions of the Royal Society: Biological sciences
[10] Friston, K. The free-energy principle: a unified brain theory? (2010), Nature reviews neuroscience