論文分享｜RETHINKING POSITIONAL ENCODING IN LANGUAGE PRE-TRAINING

Tsung-Yi, Kao

Published in

IM日記

4 min readJul 16, 2020

Introduction
Preliminary
Transformer with Untied Positional Encoding
Experiments
Conclusion

1. Introduction

原本transformer 的positional encoding方法是absolute positional embedding，這會讓「語意」和「位置資訊」中 Key和Query的dot-product （QK^T）混合起來。
例如：會有word當作key去query position或是position當key去query word的情況（會在後面的公式中看到）。但是語意和位置資訊之間的相關性應該是極低的。
[CLS]這個symbol是包含整句語意的，和別的位置不同，但是現有的positional encoding沒有將這個標籤分別處理，作者覺得這會影響模型的表達能力。

解決：

word correlation 和 positional correlation在self-attention中分開算，之後再相加，達到去耦合。
[CLS]的position encoding和其他位置的處理不同。可以更準確地學到整個句子要表達的正確意思。
作者提出了”TUPE”，讓整個模型訓練效率可以提昇，預訓練的時間也減少。

2. Preliminary

Attention Module

在前一次的論文分享，已經有提過attention的機制了，所以這篇就不多提。簡單來說，原本的self-attention是沒有考慮到word sequence的，下一個section會survey之前的方法是怎麼將positional information 加進self-attention。

Positional Encoding

positional 有兩大主流：absolute和relative

Absolute Positional Encoding：The original Transformer (Vaswani et al., 2017) proposed to use absolute positional encoding to represent positions.

In such a way, the Transformer model can differentiate the word coming from different positions. For example, in the first self-attention layer, we have：

In Vaswani et al. (2017), a hand-craft positional encoding based on sinusoid function is proposed.
But learnable positional encoding, i.e., treating pi as parameters, is often used in the recent works.

Relative Positional Encoding：as pointed out in Shaw et al. (2018), the absolute positional encoding is not effective for the model to capture the relative word orders.

Shaw et al. proposes a relative positional encoding as an inductive bias to help the learning of the self-attention module：

T5 (Raffelet al., 2019) further simplifies it by eliminating aj-i in Query-Key product：

For each j-i, bj-i is a learnable scalar and shared in all layers.

以上兩種positional encoding，詳細可以看原論文。

3. Transformer with Untied Positional Encoding

a. UNTIE THE CORRELATIONS BETWEEN POSITIONS AND WORDS

位置資訊和語意之間的訊息本來在self-attention中是相加在一起，輸入模型的，但是這兩種資訊應該是完全異質的。

absolute positional embedding對語意是沒有貢獻的，下面是將absolute positional embedding做分解：

可以看到分解後有四種資訊：word-to-position correlation, position-to-word correlation, and position-to-position correlation.

其中，對positional encoding做semantic 的projection(WQ, WK)是不合理的。

Our modification：使用不同的projection matrices，並移除分解公式中間的兩項，如下公式：

因為去掉公式(6)的中間兩項，所以scaling term （根號2d）也要做調整。
作者提出的TUPE是可以和relative positional encoding結合的：

可以從公式中看出來，公式包含了absolute和relative的positional information。

b. UNTIE THE [CLS] SYMBOL FROM POSITIONS

在Clark et al. (2019a)發現,[CLS]在attention distribution中的entropy很高，代表[CLS]包含了整個句子的global資訊。如果在relative position的公式中，沒有對[CLS]做特別的處理，那[CLS]這個標籤會對頭幾個字有Bias（特別focus on ），這會影響模型表現。

Our modification：we denote vij as the content-free (position-only) correlation between position i and j.

reset vij （query和key的相乘，就是算出來的attention 的權重）by the following equation：

where θ = {θ1, θ2} is a learnable parameter.
這個改動可以套用在任何的position only correlations。

c. IMPLEMENTATION DETAILS AND DISCUSSIONS

By combining above, we obtain a new positional encoding method and call it TUPE (Transformer with Untied Positional Encoding).
TUPE有兩種版本：第一種是absolute版本，叫做TUPE_A；第二種是結合relative的版本，叫做TUPE_R。

The multi-head version, parameter sharing, and efficiency：

TUPE can be easily extended to the multi-head version.
在multi-head版本，本來除了pi是shared 的之外，其他都是不同的，但是為了效率還是將UQ, UK shared。因為UQ, UK是shared，所以只需要在第一層計算。

Normalization and rescaling：

Layer normalization (Ba et al., 2016; Xiong et al., 2020) is a key component in Transformer. In TUPE, we also apply layer normalization on pi whenever it is used.

Redundancy in absolute positional encoding + relative positional encoding：

看似重複做了positional encoding，但是兩種不同的方法，能夠抓到不同的觀點。
利用Toeplitz matrix (Gray, 2006)去分析relative positional encoding，可以發現可以抓到local dependency of words，而absolute positional encoding中的p，因為限制成一個low-rank matrix，所以會取到比較complementary 的資訊。
absolute和relative，這兩個positional encoding，合起來就是position correlations 的 inductive bias。

4. Experiments

程式碼是公開的：All codes are implemented based on fairseq (Ott et al., 2019) in PyTorch (Paszke et al., 2017) and available at https://github.com/guolinke/TUPE.

基本上是實驗在BERT(BASE)上，但其實可以輕易apply到各種tranformer based 的模型（such as RoBERTa (Liu et al., 2019) and ELECTRA (Clark et al., 2019b).）。
Model architecture and baselines：We use BERT-Base (110M parameters) architecture for all experiments.
To compare with TUPE-A and TUPE-R, we set up two baselines correspondingly: BERT-A, which is the standard BERT-Base with absolute positional encoding (Devlin et al., 2018); BERT-R, which uses both absolute positional encoding and relative positional encoding (Raffel et al., 2019) (Eq. (5)).
We use the GLUE (General Language Understanding Evaluation) dataset (Wang et al., 2018) as the downstream tasks to evaluate the performance of the pre-trained models.

Result：

其中，TUPE-A^mid是只跑了300k步的模型。TUPE-A^tie-cls是沒有reset vij步驟的TUPE模型。BERT-A^d是使用了不同的projection方法的BERT。

可以看出作者提出的方法都高過baseline，而作者提出了兩種untie：untie 語意和位置資訊以及untie [CLS]標籤，都對模型的表現提升有一定程度的貢獻。

Figur 4是模型的收斂速度，可以看出作者提出的方法訓練的效率都比較好，收斂的比較快，且只要跑原本模型的30% step就能達到一樣的準確率。

5. Conclusion

We propose TUPE (Transformer with Untied Positional Encoding), which improves existing methods by two folds: untying the correlations between words and positions, and untying [CLS] from sequence positions.
Furthermore, with a better inductive bias over the positional information, TUPE can even outperform the baselines while only using 30% pre-training computational costs.

本篇的程式碼是公開的，網址是：https://github.com/guolinke/TUPE.
如果喜歡的的分享，可以追蹤我的Medium，也可以聯繫我的LinkedIn！