Read Paper with Me: Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Interpretation of data2vec, a model for learning across modalities

Yunchao (Lance) Liu 刘运超
6 min readJan 22, 2022

Background: Traditional machine learning relies on labeled data for training. However, the annotation of data is costly and laborious. In a larger sense, it’s impossible to label all data in the world. Recently, self-supervised learning (SSL) attracts attention as a promising method to solve this problem and for approximating the common sense in AI systems, eventually achieving Artificial General Intelligence (AGI).

Instead of using supervised signals from labeled data, SSL exploits the relationship between data. However, different modalities typically need different models for SSL. This paper hence proposes a unified framework called data2vec (the name is a play on another famous algorithm word2vec) for SSL in three modalities: images (in the original paper, computer vision refers to this), texts (sometimes referred to as language in the original paper), speech. Data2vec achieves state-of-the-art (SOTA) results for all three modalities.

Before we move on, it’s important to notice the difference between these three modalities: Images are 2D structured data. Texts are discrete 1D data and speech is continuous 1D data.

Examples of three modalities: Image, speech, language (consistently called texts in this blog). Image from the original paper.

*This work is NOT for training in multimodal learning (i.e., during training, only one of the three modalities is passed as an input, not a mixture of them) but can be helpful for multimodal learning

Method Overview: data2vec uses one model but has two modes: the teacher mode and the student mode. In each time step, the student mode of data2vect will try to learn from the teacher mode and update the model parameters

In each time step, data2vec in student mode learns from its teacher mode and updates its parameters. Image by the author.

Specifically, the teacher mode generates representation from a given sample (i.e. image, speech, text). A masked version of the same sample is passed to the student mode. Learning happens by minimizing the objective function between the student’s prediction of a target that is constructed by teachers’ parameters.

For a certain timestep, the input of the teacher is a full sample (i.e., unchanged image, speech, text) and that of the student is a masked one. The student learns from the teacher by predicting a target constructed using the last K layers of the teacher (shown in blue). Image from the original paper.

Method:

  1. Model Architecture

The samples are turned into representations by the data2vec model. A representation of the full sample comes from the teacher, and a representation of the masked sample comes from the student.

These are contextualized representations, meaning that they encode particular timestep as well as other information from the sample, due to the use of self-attention in the Transformer. This is the major difference between this work and previous works, which lack contextual information.

Samples turn into representations by passing through the data2vec model. First of all, samples are embedded into tokens. If in student mode, then a masking method is applied to the tokens before passing them to a Transformer. If in teacher mode, the tokens are directly passed to the Transformer. Image by the author.

2. Target Construction

An exponential moving average (EMA [7, 8], which places greater weights for more recent data points) is used to update the teacher’s parameters. The τ here is linearly increasing, and allows more frequent updates of the teacher at the beginning of training.

θ is the student-mode model parameters. Δ the teacher-mode model parameters and is updated using the exponential moving average (EMA [7, 8]) of θ. τ linearly increases from τ₀ to a target value τₑ for the first τₙ updates and then stays constant for the remaining of the training. Image from the original paper.

The target is then constructed by using the top K (closer to the output) blocks of the transformer.

L is the total blocks in the network. aₜˡ-circumflex (cannot type the caret symbol) is obtained by a normalization from aₜˡ. aₜˡ denotes the output of block l at timestep t. Image from the original paper.

3. Objective Function

yₜ is the target. fₜ(x) is the prediction. β controls the transition from a squared loss to an L₁ loss. When the gap is large, L₁ is used to make the loss less sensitive to outliers. Image from the original paper.

In a word, the teacher and the student have a dynamic behavior: student’s parameters are updates by optimizing the obejective function in step 3, while the teacher’s parameters are updated by calculating the the EMA in step 1. This dynamic is believed to prevent the model from collapsing into a constant representation [7]. I will write another blog to introduce [7] in the future.

Results:

The paper reports SOTA results on all three modalities.

  1. Image (metric: accuracy. Higher value, better performance):
Image from the original paper.

2. Speech (metric: word error rate. Lower value, better performance)

Image from the original paper.

3. Texts (metric: GLUE score. Higher value, better performance)

Image from the original paper.

Ablation Study:

  1. Top K blocks

The paper argues that using the average of top K blocks in the teacher mode is better than using just the top one (In this section in the paper, the authors use “top K layers”, which is inconsistent with “top K blocks” in the “Targets” section in the paper. I’m assuming “layers” and “blocks” are used interchangeably and both mean the blocks in the Transformer).

Using the average of top K blocks of the teacher model is better than just using the single top block. The results shown have better performance when the value is lower(speech), higher(NLP, i.e., texts) and higher(Vision, i.e., images ) respectively. The effect is more pronounced in speech and texts than in images. Image from the original paper.

2. Target Feature Type

Rather than just use the top K blocks, the authors also tried using different parts of the teacher mode and found that using the FFN is the best.

WER stands for word error rate. Image from the original paper.

Conclusion: The paper introduces a new general self-supervised learning framework and achieves SOTA performance for three modalities.

The framework involves one model with two modes: teacher and student. The teacher is given a full sample input while the student is given a masked input of the same sample. Self-supervised learning is achieved by letting the student learn from the teacher.

Personal Remarks:

  1. It’d be more interesting to see how this method performs for unstructured modality, e.g. graphs :P
  2. Transformer and BYOL play an important role in the success of this method. Transformer is a flexible architecture not constraint to a specific modality, so it can be applied to different modalities. And BYOL provides the core self-supervised learning part of this method.
  3. This work serves as a key step for unifying inputs from different modalities. As humans are likely to use a similar learning process to understand the visual world as they do for language [9, 10], this work is significant in bringing us closer to AGI.

References:

[1] Bao, H., Dong, L., and Wei, F. Beit: BERT pre-training of image transformers (2021). ArXiv

[2] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale (2020). arXiv

[3] Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations (2020). In Proc. of NeurIPS

[4] Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units (2016). In Proc. of ACL

[5] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding (2019). Proc. of NAACL

[6] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need (2017). In Proc. of NIPS

[7] Grill, J.-B., Strub, F., Altche, F., Tallec, C., Richemond, ´ P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., Piot, B., Kavukcuoglu, K., Munos, R., and Valko, M. Bootstrap your own latent: A new approach to self-supervised learning (2020). arXiv.

[8] Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bo- ´ janowski, P., and Joulin, A. Emerging properties in selfsupervised vision transformers (2021). arXiv

[9] Friston, K. and Kiebel, S. Predictive coding under the freeenergy principle (2009). Philosophical transactions of the Royal Society: Biological sciences

[10] Friston, K. The free-energy principle: a unified brain theory? (2010), Nature reviews neuroscience

Further reading:

BYOL ( The method that data2vec is built upon)

Jimmy Chen’s blog on data2vec

--

--

Yunchao (Lance) Liu 刘运超

CS PhD candidate @VanderbiltU, interested in developing novel deep learning methods for drug discovery. Website:www.LiuYunchao.com Twitter @YunchaoLanceLiu