Is NLP actually advancing? Have we actually cracked the Linguistic learning with deep learning model: Turning-NLG by Microsoft.

Published in

TheCyPhy

4 min readFeb 18, 2020

It has been barely a week that Microsoft announced its latest Turning-NLG model for Natural Language processing. For those who have stumbled here by chance, its an NLP model like Google’s BERT, Open AI GPT-2, Google’s XLNet, Microsoft’s MT-DNN, Facebook’s RoBERTa and Nvidia’s Megatron. In simple words, it is a model that helps in better extraction of representations of language for purposes like context extraction, question answering, and summarization capabilities.

Now we already had models which performed well, so what's different about this one?

Machine learning is the talk of the decade. It has seen notable advancements and resource allocations spread across all sectors. It has been mostly due to advancements in deep learning techniques and advancements in hardware.

The last 2 years have been golden years with respect to Natural Language processing. We have seen launches of many Transformer based language models during the course by tech giants like Google, Facebook, Microsoft and one from Carnegie Mellon University researchers in collaboration with Google AI (XLNet).

Bigger the better,really?

Named Turing Natural Language Generation (T-NLG), the model is the largest transformer model available that achieves state-of-the-art results on a range of natural language processing tasks.

Turing Natural Language Generation (T-NLG) is a 17 billion parameter language model by Microsoft that outperforms the state of the art on many downstream NLP tasks. We present a demo of the model, including its freeform generation, question answering, and summarization capabilities, to academics for feedback and research purposes. <|endoftext|>

– This summary was generated by the Turing-NLG language model itself.

The implementation of the model was done in the DeepSpeed library (compatible with PyTorch) using the ZeRO optimizer. The library was open-sourced along with the release of the T-NLG model.

The trend for the past year had been, bigger the model the; better are the results. Bigger the model and the more diverse and comprehensive the pretraining data, the better it performs at generalizing to multiple downstream tasks even with fewer training examples. Turing NLG is a model with 17 billion parameters. Before T-NLG was announced, Nvidia’s Megatron (8 billion hyperparameters) had the crown for the benchmark model. Moreover, we have seen models like DistilBert, RoBert, and XLnet which with some optimizations achieved comparable results while reducing the number of hyperparameters considerably.

T-NLG future applications

T-NLG has advanced the state of the art in natural language generation, providing new opportunities for Microsoft and our customers. Beyond saving users time by summarizing documents and emails, T-NLG can enhance experiences with the Microsoft Office suite by offering writing assistance to authors and answering questions that readers may ask about a document. Furthermore, it paves the way for more fluent chatbots and digital assistants, as natural language generation can help businesses with customer relationship management and sales by conversing with customers.

Analysis

The approach of deep learning is inspired by the human brain. The human brain has approximately 100 billion neurons which in parallel can be thought of as the parameters in Neural Networks. So, Yes size does matter in Neural networks, but that's not all. The amount of data fed into training these models is huge. GPT2 was trained on 8 million web pages which were 10X times the data fed to train GPT1. GPT2 is 1.5 billion parameter models as compared to T-NLG, 17billion parameters. The size of data used to train T-NLG is not disclosed yet but its most likely to follow the trend. It is humanly impossible to process 8 million web pages by a human. So, it's not the size, data and we’re still far from creating AI models that match the complexity of the human brain.
According to the paper titled “Energy and Policy Considerations for Deep Learning in NLP” by Emma Strubell, Ananya Ganesh, Andrew McCallum, training a transformer-based model with 213 million parameter causes as much pollution as the entire lifetime of five vehicles. Google’s famous BERT language model is 340 million parameters, OpenAI’s GPT-2 is1.5 billion parameters, Nvidia’s Megatron is 8 billion parameters, and T-NLG is 17 billion parameters. So it is very clear that increasing size has huge impacts on Nature.
The training of these models is dependent on GPU’s and TPU’s. These deep learning NLP models are just hardware and data-hungry. The cost of training these models is not at all cheap, hence we have seen models only from private big firms over the last 2 years. This acts as a bottleneck to research in deep learning models in general and prevents universities or researchers to enter the field.
We are far from creating a linguistic model that is cheap and path-breaking. We can expect the release of another model by any of the players in the market with more hyperparameters and claiming benchmarks results in the near future. The current market is a resource driven one.

Is NLP actually advancing? Have we actually cracked the Linguistic learning with deep learning model: Turning-NLG by Microsoft.

Bigger the better,really?

T-NLG future applications

Analysis

References

Written by Mustaffa Hussain