6 Natural Language Processing Models you should know

Takoua Saadani
UBIAI NLP
Published in
7 min readAug 29, 2022

Natural language processing, or NLP, is one of the most fascinating topics in artificial intelligence, and it has already spawned our everyday technological utilities.

Deep learning models that have been trained on a large dataset to perform specific NLP tasks are referred to as pre-trained models (PTMs) for NLP, and they can aid in downstream NLP tasks by avoiding the need to train a new model from scratch.

This article will introduce you to five natural language processing models that you should know about, if you want your model to perform more accurately or if you simply need an update in this field.

1- BERT

Bidirectional Encoder Representations from Transformers is abbreviated as BERT, which was created by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. It is a natural language processing machine learning (ML) model that was created in 2018 and serves as a Swiss Army Knife solution to 11+ of the most common language tasks, such as sentiment analysis and named entity recognition.

BERT, compared to recent language representation models, is intended to pre-train deep bidirectional representations by conditioning on both the left and right contexts in all layers. As a matter of fact, the pre-trained BERT representations can be fine-tuned with just one additional output layer to produce cutting-edge models for a variety of tasks, including question answering and language inference, without requiring significant task-specific modifications.

BERT’s continued success has been aided by a massive dataset of 3.3 billion words. It was trained specifically on Wikipedia with 2.5B words and Google BooksCorpus with 800M words. These massive informational datasets aided BERT’s deep understanding of not only the English language but also of our world.

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% with 7.6% absolute improvement, MultiNLI accuracy to 86.7 with 5.6% absolute improvement, and the SQuAD v1.1 question answering Test F1 to 93.2 with 1.5% absolute improvement, outperforming human performance by 2.0%.

Key performances of BERT

  • BERT suggests a pre-trained model that does not require any significant architecture changes to be applied to specific NLP tasks.
  • It advances the state-of-the-art in 11 NLP tasks, such as achieving a GLUE score of 80.4%, a 7.6% absolute improvement over the previous best result, and achieving 93.2% accuracy on SQuAD 1.1, outperforming human performance by 2%.

2- XLNet

The XLNet model was proposed in XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.

XLnet is a Transformer-XL model extension that was pre-trained using an autoregressive method to maximize the expected likelihood across all permutations of the input sequence factorization order.

XLNet is a generalized autoregressive pretraining method that enables learning in bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of other models thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT ,for example, on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks, including question answering, natural language inference, sentiment analysis, and document ranking.

Key performances of XLnet

  • The new model outperforms previous models on 18 NLP tasks, including question answering, natural language inference, sentiment analysis, and document ranking.
  • XLnet consistently outperforms BERT often by a wide margin.

3- RoBERTa:

RoBERTa is a Robustly Optimized BERT Pretraining Approach, created by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and eselin Stoyanov.

RoBERTa can match or outperform all post-BERT methods, including training the model for longer, in larger batches, with more data, removing the next sentence prediction objective, training on longer sequences, and dynamically changing the masking pattern applied to the training data. To better control for training set size effects, RoBERTa also collects a large new dataset (CC-NEWS) of comparable size to other privately used datasets. When training data is controlled for, RoBERTa’s improved training procedure outperforms published BERT results on both GLUE and SQUAD. When trained over more data for a longer period of time, this model achieves a score of 88.5 on the public GLUE leaderboard, which matches the 88.4 reported by Yang et al (2019).

Key performances of RoBERTa

  • On the GLUE benchmark, the new model matches the recently introduced XLNet model and establishes a new state of the art in four out of nine individual tasks.
  • On the General Language Understanding Evaluation (GLUE) benchmark, RoBERTa outperforms BERT in all individual tasks.

4- ALBERT

ALBERT is a Lite BERT for Self-supervised Learning of Language Representations developed by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. It was originally proposed after the Google Research team addressed the problem of the continuously growing size of the pretrained language models, which results in memory limitations, longer training time, and sometimes unexpectedly degraded performance.

ALBERT employs two parameter-reduction techniques, namely factorized embedding parameterization and cross-layer parameter sharing. In addition, the proposed method includes a self-supervised loss for sentence-order prediction to improve inter-sentence coherence. The experiments show that the best version of ALBERT achieves new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while using fewer parameters than BERT-large.

Key performances of ALBERT

  • The much larger ALBERT configuration, which still has fewer parameters than BERT-large, outperforms all current state-of-the-art language modes, achieving 89.4% accuracy on the RACE benchmark, an F1 score of 92.2 on the SQuAD 2.0 benchmark, and an F1 score of 92.2 on the GLUE benchmark.
  • The ALBERT configuration, with 18 fewer parameters and 1.7 faster training compared to the original BERT-large model achieves only slightly worse performance using the introduced parameter-reduction techniques.

5- PaLM:

PALM or Scaling Language Modeling with Pathways, developed by Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung.

The Pathways Language Model (PaLM) is a 540-billion parameter and dense decoder-only Transformer model trained with the Pathways system. The goal of the Pathways system is to orchestrate distributed computation for accelerators. With PALM, it is possible to train a single model across multiple TPU v4 Pods. The experiments on hundreds of language understanding and generation tasks demonstrated that PaLM achieves state-of-the-art few-shot performance across most tasks, with breakthrough capabilities demonstrated in language understanding, language generation, reasoning, and code-related tasks.

Key performances of PALM

  • Numerous experiments show that as the team scaled up to their largest model, model performance skyrocketed.
  • PaLM 540B achieved ground-breaking performance on a variety of extremely difficult tasks, including language understanding and generation, reasoning, and code generation.

6- GPT-3

Generative Pre-trained Transformer 3 is an autoregressive language model that uses deep learning to produce human-like text.

Given the wide variety of possible tasks and the difficulty of collecting a large labeled training dataset, researchers proposed an alternative solution, which was scaling up language models to improve task-agnostic few-shot performance.
They put their solution to the test by training and evaluating a 175B-parameter autoregressive language model called GPT-3 on a variety of NLP tasks.
The evaluation results show that GPT-3 achieves promising results and occasionally outperforms the state of the art achieved by fine-tuned models under few-shot learning, one-shot learning, and zero-shot learning.

Key performances of GPT-3

  • It can create anything with a text structure, and not just human language text.
  • It can automatically generate text summarizations and even programming code.

Conclusion

NLP language models are a critical component in improving machine learning capabilities. They democratize access to knowledge and resources while also fostering a diverse community.

When it comes to choosing the best NLP language model for an AI project, it is primarily determined by the scope of the project, dataset type, training approaches, and a variety of other factors that we can explain in other articles.

--

--

Takoua Saadani
UBIAI NLP

MSc in Projects Management I Associate Structural Engineer I Marketer