論文分享｜Parameter-Efficient Transfer Learning for NLP

Tsung-Yi, Kao

Published in

IM日記

3 min readJul 27, 2020

ICML2019

Introduction
Adapter tuning for NLP
Experiments

1. Introduction

作者希望使用pre-trained model像是BERT這種模型時，在面對每個新的task的時候不用重新train整個model。所以作者提出了一種transfer learning的方法來處理這個問題。

NLP中做transfer learning的技術主要有兩種，第一種是feature-based transfer，第二種是fine-tuning。作者提出了另一種方法 -> adapter module

對於每個task來說，feature-based和fine-tuning都需要train一組新的權重，但是adapter可以更有效率的運用參數。

feature-based->train 一組新的參數
fine-tuning -> 重新調整原本模型的參數
adapter->在原本的模型layer之間加上”adapter模組”，然後固定住原本模型的參數，只train模組內參數。

adapter-based tuning和multi-task和continual learning有關。

multi-task learning需要同時處理不同的task，但是adapter-based不需要。
continual learning在面對新的task時，re-training後就會忘記前面學了什麼，但是adapter-based不會，因為task之間不會interact，shared 的參數會固定住，task-specific的參數很少。

our strategy almost matches the performance of the fully fine-tuned BERT, but uses only 3% task-specific parameters, while fine-tuning uses 100% task-specific parameters.

2. Adapter tuning for NLP

作者提出的方法，有三大特點：

it attains good performance
it permits training on tasks sequentially, that is, it does not require simultaneous access to all datasets ->不需要同時處理不同的資料集。
it adds only a small number of additional parameters per task ->對於每個新的task只需要增加很少的參數。

Tuning with adapter modules involves adding a small number of new parameters to a model, which are trained on the downstream task->adapter 是在原本的模型加入新的layers

上圖為Adapters的架構，右方的文字為簡單的解釋。以下來說明Adapters是怎麼運作的：

每一層的Transformer layer 會接上兩個adapters，分別在MHA-projection後面以及兩層的Feed-forward後面。
原始模型的權重不動，adapter layers的初始權重是隨機的。
在原始的fine-tuning中，new top-layer和原本的網路是一起train的，但是在adapter tuning中，原本模型的權重是固定的，所以可以讓模型share到其他不同的的task。
Adapter modules 有兩個主要的特色： a small number of parameters, and a near-identity initialization(By initializing the adapters to a near-identity function, original network is unaffected when training starts.).
adapter module中的殘差，是為了當projection layers被初始化成０時，可以讓adaption接近identity function ->就是上面所說的「near-identity initialization」。
adapter module中的bottleneck是為了降低參數量。
除了adapter中的參數，layer normalization的參數也會重新訓練。

We instantiate adapter-based tuning for text Transformers. These models attain state-of-the-art performance in many NLP tasks, including translation, extractive QA, and text classification.

作者實作adapters，基於原始的Transformer {Vaswani et al. (2017)}。

3. Experiments

We show that adapters achieve parameter efficient transfer for text tasks.
All runs are trained on 4 Google Cloud TPUs with a batch size of 32. For each dataset and algorithm, we run a hyperparameter sweep and select the best model according to accuracy on the validation set.

在 GLUE tasks, 是report在GLUE官方網站上的submission test metrics，其他的分類任務是用test-set accuracy。Baseline 是 fine-tuning的BERT模型。

首先是GLUE：所用的模型是BERT-LARGE。

實驗同時有用 fixed adapter size ( bottleneck 的unit數), 以及依照不同的task選擇適合的 bottleneck，範圍是：{8, 64, 256}.

we re-run 5 times with different random seeds and select the best model on the validation set.

可以看到Adapters新增且調整了極少的參數，就能夠達到或是接近full fine-tuning的結果。

再來是額外的分類實驗，在網路上收集公開的文字分類資料集。

adapter的 sizes 在{2, 4, 8, 16, 32, 64}中選擇最佳。

其中，NO BERT 這個baseline是使用”AutoML”選擇模型的架構；Variable FT是代表 variable fine-tuning，只fine tune最上面n層的參數，n = {1, 2, 3, 5, 7, 9, 11, 12}.在額外的分類任務中，是用BERT-base，所以full fine-tuning 是 n = 12。

Parameter/Performance trade-off

可以看出，一般的BERT在較少參數調整的情況下，表現就不太好，但是Adapters不同，在調整很少參數的情況下，任然維持相當好的表現。

Analysis and Discussion

實驗發現，如果移除模型中部分adapters layers，對模型表現影響不大，但是全部移除，影響非常大。
adapter在higher layers的影響比在lower layers的影響來的大。可能是因為lower layers是抓到比較lower-level的feature，higher layers是抓到比較task-specific的feature。->在實驗中，只有fine-tuning 最上面兩層的表現是最好的，這點也可以佐證這個想法。

Finally, we tried a number of extensions to the adapter’s architecture that did not yield a significant boost in performance.->例如增加adapter層數、增加layer/batch normalization等等。

所以作者覺得他們所提出最原始的Adapters架構是最好的。

Reference：

arXiv論文位址：https://arxiv.org/abs/1902.00751

今天的分享就到這，我會不定期分享一些關於機器學習知識，或是一些程式的實作，如果喜歡我的文章，可以關注我。如過有任何問題歡迎寄信給我：solomonjoeykao@gmail.com，也可以在LinkedIn上跟我聯繫。