Transfer Learning In NLP — Part 2

The new tricks

Pratik Bhavsar
May 15, 2020 · 5 min read
Image for post
Image for post
Photo by David Boca on Unsplash

This is a follow up to my inception post Transfer Learning In NLP. In case you are just getting started in NLP, have a look at it first.

A lot has happened after Oct 2018 — when BERT was released.

Do you know BERT’s masked language modelling is old school?

Do you know attention doesn’t need to be quadratic in time?

Do you know you can steal Google’s model?

Some of the smartest people of our generation have been working intensely and churning out a lot! NLP is probably the sexiest field to be in right now 😋

These 20 questions will test how updated you are with the current state of NLP and make you ready for any technical discussion.

After all, you are only as good as the number of jargons you know 😆

What is the current state of Pre-trained models(PTM)?

Image for post
Image for post
https://arxiv.org/pdf/2003.08271.pdf

Which tasks are used for training PTMs?

Image for post
Image for post
https://arxiv.org/pdf/2003.08271.pdf

What is the current state of PTMs on GLUE?

Image for post
Image for post
https://arxiv.org/pdf/2003.08271.pdf

Does more data always lead to better language model?

T5 paper says no. Quality matter more than quantity.

Image for post
Image for post
https://arxiv.org/pdf/1910.10683.pdf

What tokenisation method seems best for training language models?

This paper says that a new method Unigram LM is better than BPE and WordPiece.

Which task is best for training a language model?

The current best approach is by ELECTRA → Replacing an input token with the help of a generator and then using a discriminator to predict which token was corrupted.

Image for post
Image for post
https://arxiv.org/pdf/2003.10555.pdf
Image for post
Image for post
https://arxiv.org/pdf/2003.10555.pdf

Also T5 paper says dropping a span of 3 is good.

Image for post
Image for post
https://arxiv.org/pdf/1910.10683.pdf

Is gradual unfreezing needed for task training of transformer?

T5 paper says no.

Image for post
Image for post
https://arxiv.org/pdf/1910.10683.pdf

What would you change to get a better language model if you have a fixed budget of training?

T5 paper suggests increasing both size and training steps.

Image for post
Image for post
https://arxiv.org/pdf/1910.10683.pdf

Which model would you use if your sequence is longer than 512 tokens?

Transformer-XL or Longformer

How does the processing time of transformer scale with sequence length?

Quadratic

How can we bring down the processing time of transformers for long documents since it is quadratic in nature with sequence length?

Longformer uses an attention mechanism that scales linearly with sequence length.

Image for post
Image for post
https://arxiv.org/pdf/2004.05150.pdf

Longformer can be really good for encoding long documents for semantic search. Below table shows the work done till now.

Image for post
Image for post
https://arxiv.org/pdf/2004.05150.pdf

Does BERT perform great because of its attention layer?

Paper Attention is not Explanation argues that attention is not properly correlated to outputs and hence we cannot say the model performs better because of attention mechanism.

Will the performance of BERT fall drastically if we turned off a head?

No — As per paper Revealing the Dark Secrets of BERT

Will the performance of BERT fall drastically if we turned off a layer?

No — As per paper Revealing the Dark Secrets of BERT

Will the performance of BERT fall drastically if we randomly initialised it?

No — As per paper Revealing the Dark Secrets of BERT

Do we really need model compression?

Maybe not! Notes from this amazing article.

“Model compression techniques give us a hint about how to train appropriately-parameterized models by elucidating the types of solutions over-parameterized models tend to converge to. There are many types of model compression, and each one exploits a different type of “simplicity” that tends to be found in trained neural networks:”

  • Many weights are close to zero (Pruning)
  • Weight matrices are low rank (Weight Factorization)
  • Weights can be represented with only a few bits (Quantization)
  • Layers typically learn similar functions (Weight Sharing)

Can we steal a model if exposed as API?

Yes, we can → explained in this mind-blowing post.

What is the current state of distillation?

Image for post
Image for post
https://arxiv.org/pdf/2003.08271.pdf

Does a bigger transformer model train faster or a smaller?

Bigger model as per this article.

What are the applications of student-teacher framework?

Knowledge distillation for making smaller models

Image for post
Image for post
https://arxiv.org/pdf/1909.10351.pdf

Generate similar sentence embeddings for different languages

Image for post
Image for post
https://arxiv.org/pdf/2004.09813v1.pdf

How to design models? Which are the most important parameters?

This is a very difficult question!

All this is explained in Scaling Laws for Neural Language Models which gives us the correlations between

  • test loss Vs compute
  • test loss Vs dataset size
  • test loss Vs parameters

To design and train a model, first decide the architecture. Then get its no. of parameters. From this, you can calculate the loss. Then select the data size and compute required.

See the scaling equations in the graph.

Image for post
Image for post
Scaling Laws for Neural Language Models

Finallyyy…

I hope you got to learn as much as I did going through these questions. To become better NLP engineers, we need to have a deeper understanding of transfer learning which is evolving very quickly.

I believe these ideas will keep you busy for weeks 😃😬😎

Modern NLP

All the latest techniques in NLP — Natural Language…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store