How to become an expert in NLP in 2019 (1)

In this post, I would focus on all of the theoretical knowledge you need for the latest trends in NLP. I made this reading list as I learned new concepts. In the next post, I would share the things that I use to practice these concepts including fine-tuning and models from rank 1 on competition leaderboards. Use this link to get to part 2 (Still to make).

For the resources, I include papers, blogs, videos.

It is not necessary to read most of the stuff. Your main goal should be to understand that in this paper this thing was introduced and do I understand how it works, how it compares it with state of the art.

Trend: Use bigger transformer based models and solve multi-task learning.

Warning: It is an increasing trend in NLP that if you have a new idea in NLP during reading any of the papers, you will have to use massive compute power to get any reasonable results. So you are limited by the open-source models.

  1. fastai:- I had already watched the videos, so I thought I should add it to the top of the list.
  • Lesson 4 Practical Deep Learning for Coders. It will get you up with how to implement a language model in fastai.
  • There is Lesson 12 in part 2 of the course but it is still to released officially so I would update the link when it is uploaded.

2. LSTM:- Although transformers are mainly used nowadays, in some cases you can still use LSTM and it was the first successful model to get good results. You should use AWD_LSTM now if you want.

3. AWD_LSTM:- It was proposed to overcome the shortcoming of LSTM by introducing dropout between hidden layers, embedding dropout, weight tying. You should use AWS_LSTM instead of LSTM.

4. Pointer Models:- Although not necessary, it is a good read. You can think of it as pre-attention theory.

Extra: What is the difference between weight decay and regularization? In weight decay, you directly add something to the update rule while in regularization it is added to the loss function. Why bring this up? Most probably the DL libraries are using weight_decay instead of regularization under the hood.

In some of the papers, you would see that the authors preferred SGD over Adam, citing that Adam does not give good performance. The reason for that is (maybe) PyTorch/Tensorflow are doing the above mistake. This thing is explained in detail in this post.

5. Attention:- Just remember Attention is not all you need.

There is a lot of research going on to make better transformers, maybe I will read more papers on this in the future. Some other transformers include the Universal Transformer and Evolved Transformer which used AutoML to come up with Transformer architecture.

The reason why new transformer architectures do not solve the problem. Because you need language models for your NLP tasks which use these transformer blocks. In most of the cases, you will not have the computation resources necessary to train these models as it has been found that the more transformer blocks you use the better. Also, you need larger batch sizes to train these Language Models which means you have to use either Nvidia DGX or Google Cloud TPUs(PyTorch support coming someday).

6. Random resources:- You can skip this section. But for completeness, I provide all the resources I used.

7. Multi-task Learning:- I am really excited about this. In this case, you train a single model for multiple tasks (more than 10 if you want). So your data looks like “translate to english some_text_in_german”. Your model actually learns to use the initial information to choose the task that it should perform.

8. PyTorch:- Pytorch provide good tutorials giving you good references on how to code up most of the stuff in NLP. Although transformers are not in the tutorials but still you should see the tutorials once.


Now we come to the latest research in NLP which has resulted in the NLP’s Imagenet moment. All you need to understand is how Attention works and you are set.

9. ELMo:- The first prominent research done where we moved from pretrained word-embeddings to using pretrained-models for getting the word-embeddings. So you use the input sentence to get the embeddings for the tokens present in the sentence.

10. ULMFit:- Is this better than BERT maybe not, but still in Kaggle competitions and external competitions ULMFiT gets the first place.

11. OpenAI GPT:- I have not compared BERT with GPT2, but you work on some kind on ensemble if you want. Do not use GPT1 as BERT was made to overcome the limitations of GPT1.

12. BERT:- The most successful language model right now (as of May 2019).

In order to use all these models in PyTorch you should use hugginface/pytorch_pretrained_BERT repo which gives you complete implementations along with pretrained models for BERT, GPT1, GPT2, TransformerXL.

13. Next Blog:- I may get late writing the next blog so wanted to share this last thing.

Congrats you made it to the end. You now have most of the theoretical knowledge needed to practice NLP using the latest models and techniques.

What to do now?

You only learned the theory, now practice as much as you can. Create crazy ensembles if you want, try to get on top of the leaderboards. I am struggling right now to practice my NLP tasks as I am busy doing some computer vision projects which you check below or on my github.

Most probably I would make a follow-up post by mid or end June giving a list like this one, with some new techniques that I have in mind to read and the things I will do for practice.

If you find this post useful, share this with others that may benefit from it and give this post a clap (it helps a lot, you can give 50 claps max).

Follow me on Medium to get my latest blog posts in your medium feed. My socials linkedin, github, twitter.

My previous blog posts

  1. Training AlexNet with tips and checks on how to train CNNs: Practical CNNs in PyTorch(1)
  2. All you need for Photorealistic Style Transfer in PyTorch
  3. SPADE: State of the art in Image-to-Image Translation by Nvidia
  4. Weight Standardization: A new normalization in town