Transfer Learning In NLP — Part 2

The new tricks

Pratik Bhavsar | @nlpguy_
Modern NLP
5 min readMay 15, 2020

--

Photo by David Boca on Unsplash

This is a follow up to my inception post Transfer Learning In NLP. In case you are just getting started in NLP, have a look at it first.

A lot has happened after Oct 2018 — when BERT was released.

Do you know BERT’s masked language modelling is old school?

Do you know attention doesn’t need to be quadratic in time?

Do you know you can steal Google’s model?

Some of the smartest people of our generation have been working intensely and churning out a lot! NLP is probably the sexiest field to be in right now 😋

NLProc has come a long way.

We need another summary!!!

These 20 questions will test how updated you are with the current state of NLP and make you ready for any technical discussion.

After all, you are only as good as the number of jargons you know 😆

What is the current state of Pre-trained models(PTM)?

https://arxiv.org/pdf/2003.08271.pdf

Which tasks are used for training PTMs?

https://arxiv.org/pdf/2003.08271.pdf

What is the current state of PTMs on GLUE?

https://arxiv.org/pdf/2003.08271.pdf

Does more data always lead to better language model?

T5 paper says no. Quality matter more than quantity.

https://arxiv.org/pdf/1910.10683.pdf

What tokenisation method seems best for training language models?

This paper says that a new method Unigram LM is better than BPE and WordPiece.

Which task is best for training a language model?

The current best approach is by ELECTRA → Replacing an input token with the help of a generator and then using a discriminator to predict which token was corrupted.

https://arxiv.org/pdf/2003.10555.pdf
https://arxiv.org/pdf/2003.10555.pdf

Also T5 paper says dropping a span of 3 is good.

https://arxiv.org/pdf/1910.10683.pdf

Is gradual unfreezing needed for task training of transformer?

T5 paper says no.

https://arxiv.org/pdf/1910.10683.pdf

What would you change to get a better language model if you have a fixed budget of training?

T5 paper suggests increasing both size and training steps.

https://arxiv.org/pdf/1910.10683.pdf

Which model would you use if your sequence is longer than 512 tokens?

Transformer-XL or Longformer

How does the processing time of transformer scale with sequence length?

Quadratic

How can we bring down the processing time of transformers for long documents since it is quadratic in nature with sequence length?

Longformer uses an attention mechanism that scales linearly with sequence length.

https://arxiv.org/pdf/2004.05150.pdf

Longformer can be really good for encoding long documents for semantic search. Below table shows the work done till now.

https://arxiv.org/pdf/2004.05150.pdf

Does BERT perform great because of its attention layer?

Paper Attention is not Explanation argues that attention is not properly correlated to outputs and hence we cannot say the model performs better because of attention mechanism.

Will the performance of BERT fall drastically if we turned off a head?

No — As per paper Revealing the Dark Secrets of BERT

Will the performance of BERT fall drastically if we turned off a layer?

No — As per paper Revealing the Dark Secrets of BERT

Will the performance of BERT fall drastically if we randomly initialised it?

No — As per paper Revealing the Dark Secrets of BERT

Do we really need model compression?

Maybe not! Notes from this amazing article.

“Model compression techniques give us a hint about how to train appropriately-parameterized models by elucidating the types of solutions over-parameterized models tend to converge to. There are many types of model compression, and each one exploits a different type of “simplicity” that tends to be found in trained neural networks:”

  • Many weights are close to zero (Pruning)
  • Weight matrices are low rank (Weight Factorization)
  • Weights can be represented with only a few bits (Quantization)
  • Layers typically learn similar functions (Weight Sharing)

Can we steal a model if exposed as API?

Yes, we can → explained in this mind-blowing post.

What is the current state of distillation?

https://arxiv.org/pdf/2003.08271.pdf

Does a bigger transformer model train faster or a smaller?

Bigger model as per this article.

What are the applications of student-teacher framework?

Knowledge distillation for making smaller models

https://arxiv.org/pdf/1909.10351.pdf

Generate similar sentence embeddings for different languages

https://arxiv.org/pdf/2004.09813v1.pdf

How to design models? Which are the most important parameters?

This is a very difficult question!

All this is explained in Scaling Laws for Neural Language Models which gives us the correlations between

  • test loss Vs compute
  • test loss Vs dataset size
  • test loss Vs parameters

To design and train a model, first decide the architecture. Then get its no. of parameters. From this, you can calculate the loss. Then select the data size and compute required.

See the scaling equations in the graph.

Scaling Laws for Neural Language Models

Finallyyy…

I hope you got to learn as much as I did going through these questions. To become better NLP engineers, we need to have a deeper understanding of transfer learning which is evolving very quickly.

I believe these ideas will keep you busy for weeks 😃😬😎

--

--