論文分享|RETHINKING POSITIONAL ENCODING IN LANGUAGE PRE-TRAINING

Tsung-Yi, Kao
IM日記
Published in
4 min readJul 16, 2020
  1. Introduction
  2. Preliminary
  3. Transformer with Untied Positional Encoding
  4. Experiments
  5. Conclusion

1. Introduction

  • 原本transformer 的positional encoding方法是absolute positional embedding,這會讓「語意」和「位置資訊」中 Key和Query的dot-product (QK^T)混合起來。
  • 例如:會有word當作key去query position或是position當key去query word的情況(會在後面的公式中看到)。但是語意和位置資訊之間的相關性應該是極低的。
  • [CLS]這個symbol是包含整句語意的,和別的位置不同,但是現有的positional encoding沒有將這個標籤分別處理,作者覺得這會影響模型的表達能力。

解決:

  • word correlation 和 positional correlation在self-attention中分開算,之後再相加,達到去耦合。
  • [CLS]的position encoding和其他位置的處理不同。可以更準確地學到整個句子要表達的正確意思。
  • 作者提出了”TUPE”,讓整個模型訓練效率可以提昇,預訓練的時間也減少。

2. Preliminary

Attention Module

在前一次的論文分享,已經有提過attention的機制了,所以這篇就不多提。簡單來說,原本的self-attention是沒有考慮到word sequence的,下一個section會survey之前的方法是怎麼將positional information 加進self-attention。

Positional Encoding

positional 有兩大主流:absolute和relative

Absolute Positional Encoding:The original Transformer (Vaswani et al., 2017) proposed to use absolute positional encoding to represent positions.

  • In such a way, the Transformer model can differentiate the word coming from different positions. For example, in the first self-attention layer, we have:
  • In Vaswani et al. (2017), a hand-craft positional encoding based on sinusoid function is proposed.
  • But learnable positional encoding, i.e., treating pi as parameters, is often used in the recent works.

Relative Positional Encoding:as pointed out in Shaw et al. (2018), the absolute positional encoding is not effective for the model to capture the relative word orders.

  • Shaw et al. proposes a relative positional encoding as an inductive bias to help the learning of the self-attention module:
  • T5 (Raffelet al., 2019) further simplifies it by eliminating aj-i in Query-Key product:
  • For each j-i, bj-i is a learnable scalar and shared in all layers.

以上兩種positional encoding,詳細可以看原論文。

3. Transformer with Untied Positional Encoding

a. UNTIE THE CORRELATIONS BETWEEN POSITIONS AND WORDS

位置資訊和語意之間的訊息本來在self-attention中是相加在一起,輸入模型的,但是這兩種資訊應該是完全異質的。

absolute positional embedding對語意是沒有貢獻的,下面是將absolute positional embedding做分解:

可以看到分解後有四種資訊:word-to-position correlation, position-to-word correlation, and position-to-position correlation.

其中,對positional encoding做semantic 的projection(WQ, WK)是不合理的。

Our modification:使用不同的projection matrices,並移除分解公式中間的兩項,如下公式:

  • 因為去掉公式(6)的中間兩項,所以scaling term (根號2d)也要做調整。
  • 作者提出的TUPE是可以和relative positional encoding結合的:
  • 可以從公式中看出來,公式包含了absolute和relative的positional information。

b. UNTIE THE [CLS] SYMBOL FROM POSITIONS

在Clark et al. (2019a)發現,[CLS]在attention distribution中的entropy很高,代表[CLS]包含了整個句子的global資訊。如果在relative position的公式中,沒有對[CLS]做特別的處理,那[CLS]這個標籤會對頭幾個字有Bias(特別focus on ),這會影響模型表現。

Our modification:we denote vij as the content-free (position-only) correlation between position i and j.

  • reset vij (query和key的相乘,就是算出來的attention 的權重)by the following equation:
  • where θ = {θ1, θ2} is a learnable parameter.
  • 這個改動可以套用在任何的position only correlations。

c. IMPLEMENTATION DETAILS AND DISCUSSIONS

  • By combining above, we obtain a new positional encoding method and call it TUPE (Transformer with Untied Positional Encoding).
  • TUPE有兩種版本:第一種是absolute版本,叫做TUPE_A;第二種是結合relative的版本,叫做TUPE_R。

The multi-head version, parameter sharing, and efficiency:

  • TUPE can be easily extended to the multi-head version.
  • 在multi-head版本,本來除了pi是shared 的之外,其他都是不同的,但是為了效率還是將UQ, UK shared。因為UQ, UK是shared,所以只需要在第一層計算。

Normalization and rescaling:

  • Layer normalization (Ba et al., 2016; Xiong et al., 2020) is a key component in Transformer. In TUPE, we also apply layer normalization on pi whenever it is used.

Redundancy in absolute positional encoding + relative positional encoding:

  • 看似重複做了positional encoding,但是兩種不同的方法,能夠抓到不同的觀點。
  • 利用Toeplitz matrix (Gray, 2006)去分析relative positional encoding,可以發現可以抓到local dependency of words,而absolute positional encoding中的p,因為限制成一個low-rank matrix,所以會取到比較complementary 的資訊。
  • absolute和relative,這兩個positional encoding,合起來就是position correlations 的 inductive bias。

4. Experiments

程式碼是公開的:All codes are implemented based on fairseq (Ott et al., 2019) in PyTorch (Paszke et al., 2017) and available at https://github.com/guolinke/TUPE.

  • 基本上是實驗在BERT(BASE)上,但其實可以輕易apply到各種tranformer based 的模型(such as RoBERTa (Liu et al., 2019) and ELECTRA (Clark et al., 2019b).)。
  • Model architecture and baselines:We use BERT-Base (110M parameters) architecture for all experiments.
  • To compare with TUPE-A and TUPE-R, we set up two baselines correspondingly: BERT-A, which is the standard BERT-Base with absolute positional encoding (Devlin et al., 2018); BERT-R, which uses both absolute positional encoding and relative positional encoding (Raffel et al., 2019) (Eq. (5)).
  • We use the GLUE (General Language Understanding Evaluation) dataset (Wang et al., 2018) as the downstream tasks to evaluate the performance of the pre-trained models.

Result:

其中,TUPE-A^mid是只跑了300k步的模型。TUPE-A^tie-cls是沒有reset vij步驟的TUPE模型。BERT-A^d是使用了不同的projection方法的BERT。

可以看出作者提出的方法都高過baseline,而作者提出了兩種untie:untie 語意和位置資訊以及untie [CLS]標籤,都對模型的表現提升有一定程度的貢獻。

Figur 4是模型的收斂速度,可以看出作者提出的方法訓練的效率都比較好,收斂的比較快,且只要跑原本模型的30% step就能達到一樣的準確率。

5. Conclusion

  1. We propose TUPE (Transformer with Untied Positional Encoding), which improves existing methods by two folds: untying the correlations between words and positions, and untying [CLS] from sequence positions.
  2. Furthermore, with a better inductive bias over the positional information, TUPE can even outperform the baselines while only using 30% pre-training computational costs.

本篇的程式碼是公開的,網址是:https://github.com/guolinke/TUPE.

如果喜歡的的分享,可以追蹤我的Medium,也可以聯繫我的LinkedIn

--

--

Tsung-Yi, Kao
IM日記

台灣大學資訊管理研究所畢業,現職玉山銀行智能金融處,分享一些知識,也歡迎大家與我討論問題~