TRANSFER LEARNING

Get the Fine-tuning Just Right

For NTCIR-15 FinNum-2 and DialEval-1 Tasks

Published in

The Zeals Tech Blog

5 min readDec 14, 2020

Disclaimer: this post won’t use any author’s or editorial we. The overconfident title and content are opinionated. The work described here is a team effort, but my words don’t express the view of my employer and co-authors.

On December 14, 2020, NTCIR-15 conference has concluded successfully. I have led two teams to work on the conference’s shared task, namely DialEval-1 and FinNum-2. My teams’ works have reached the first and the second places for DialEval-1 and FinNum-2, respectively. Following my previous post that introduces NTCIR and DialEval, this one will describe what FinNum-2 is, report the contribution of the team works, and propose a workflow that generalizes how I apply transfer learning for tasks like DialEval and FinNum.

The road so far

Since I work as an NLP-ML engineer at ZEALS, dealing with practical concerns of transfer learning is a part of my job description. Despite all the attentions to Transformers and BERTology, getting those models fine-tuned just right is still nontrivial, and some perspectives of fine-tuning are less studied than others. This post will use DialEval-1 and FinNum-2 as examples to show what those under-invested subjects are. (Should it be any unexplained acronym, please kindly refer to the previous post.)

FinNum is a task for fine-grained numeral understanding in online financial texts. For tweets about stock price, FinNum-1 at NTCIR-14 has tried to disambiguates the meaning of the numerals and found it insufficient for pragmatic uses, such that FinNum-2 at NTCIR-15 wishes to identify the link between cashtags and numerals. Figure 1 illustrates an example linkage.

Figure 1. Numeral Attachment. Image Credit: FinNum-2 Overview Paper

Whilst there is more than one way to do it, FinNum-2 organizers frame the task as a binary classification problem. Although DialEval-1 doesn’t enforce a particular scheme for its subtasks ND and DQ, all participants model them as multiclass classification and multi-label classification, respectively. Therefore, those two tasks provide an opportunity of examining one general and practical approach with various classification settings and datasets, at least that’s the way I see them.

Devil’s Bargain

Declaring something “general” and “practical” could be subjective, so allow me to quote the official reports for the actual outcomes from my teams, IMTKU and CYUT*. (For detail statistics, please check the papers by the task organizers: DialEva-1, FinNum-2.)

For DialEval-1:

At STC-3, none of the participant runs are statistically significantly better than the BL-LSTM model. However, at DialEval-1, IMTKU-run2 outperforms the baselines significantly (𝑝 < 0.5) in Chinese DQ subtask in terms of NMD.

For FinNum-2:

(CYUT) experiment on both BERT and RoBERTa. … We find that the vanilla BERT and RoBERTa achieve good performances.

(It is actually XLM-RoBERTa, and I will explain the difference soon.)

It all sounds well and good, except that I’m trading time and energy with big bad pretrained models. If applied blindly, generalization ability of those models could be disappointing, not to mention that “practical” sometimes implies special treatments that won’t hold for new tasks. So I figure, at least a repeatable procedure may be a baseline that I can improve upon.

I’m a man of letters

Some may consider it obvious, the bigger the model, the better the performance, isn’t it? However, as the FinNum-2 organizers have pointed out,

“The difference between the performances of different teams using the same architecture may be caused by the preprocessing procedures and the hyperparameter settings.”

In my experience, there are more factors besides preprocessing and hyperparameter. The diagram below demonstrates my decision process.

The Activity Diagram of My Decision Process for Fine-Tuning

Here I’d like to emphasize the subtle differences among tokenizers. For example, XLM-RoBERTa uses unigram SentencePiece, whereas RoBERTa uses byte-pair encoding (BPE). Roughly speaking, BPE recursively couples bytes in a left-to-right fashion. For certain languages, the resultant subwords may be less than ideal. That being said, since different models predetermine their own tokenizations with variations of training data sets and training schemes, I’m not aware of any universal answer. Empirically, I find that unigram SentencePiece and whole-word masking based models are more likely to outperform WordPiece and BPE based models.

Admittedly, the process isn’t perfect. For example, I have found an intriguing error case, “2C”:

In the test set, a tweet uses [it] to refer the link between global warming and the stock price of Tesla. In the training and the development sets, however, all of the “2C” and “2c” stand for “to see.” This case indicates that, both informal usages of tweet and the domain knowledge of stocks can use some more efforts.

Will polishing the rest steps from the above diagram help resolving this issue? I don’t know yet, but I’d like to think that data augmentation is promising. As for keywords of those steps, I will simply put a hyperlinked list together here for your reference:

BPE, WordPiece, and SentencePiece
Whole-Word Masking
When Does Label Smoothing Help?
Mixed Precision
One-cycle Policy
Discriminative Fine-tuning / Gradual unfreezing
Default and Custom layer groups.
Ranger; LAMB
Don’t Decay the Learning Rate, Increase the Batch Size.
Text data augmentations: EDA, nlpaug, TextAttack, etc.

Carry on

One more thing that may be worth mentioning is, with One-cycle Policy, I have managed to fine-tune models for DialEval-1 and FinNum-2 with fewer epochs than other teams. For example, I only need 5 epochs for FinNum-2, while others usually require 30. Not only it benefits me by the time saved, the lowered energy consumption may also contribute a little to the world. A recent controversy has reminded me that the pursuit of transfer learning is at the expense of environment. Honestly, I don’t know if me trying to find more efficient ways to do transfer learning can really make a difference, but it surely makes me feel less guilty. Interestingly, some endeavors of machine learning may also have indirect contributions for the greater good. For example, to go beyond Mixed Precision, training models with a 4-bit chip may liberate personal data from cloud, and then personalized and federated models may empower more underrepresented people.

* Conventionally almost all participants in NTCIR name the teams by colleges regardless the involvement of industries, even if I’m the first author on behalf of my company.