LePetit: A pre-training efficient and lightning fast French Language Model

Micheli Vincent
Jul 14, 2020 · 6 min read
Credits: Jérôme Daburon

Recent advances in language modeling have led to computationally intensive and resource demanding state-of-the-art models. In an effort towards sustainable practices, we introduce LePetit: a tiny French Language Model.

In this story we’ll also discuss why compact models are necessary and evaluate LePetit on the French Question Answering Dataset (FQuAD) as well as the Cross Lingual Sentiment (CLS) dataset.

If you’re looking for a more in-depth analysis, we will soon release a research paper on the importance of pre-training data volume for compact language models. In the meantime, the model is available in the Hugging Face collection!

The need for compact language models

Following these advances, the ongoing trend has been to build bigger models with an ever-increasing amount of data (e.g. RoBERTa) and parameters (e.g. GPT-3). However, pre-training models with billions of parameters over hundreds of gigabytes of text requires tremendous computational resources that only a few companies and institutions can afford. Besides, these cumbersome models introduce significant delays at inference time, especially on non-dedicated hardware. Hence, our goal is to explore model architectures and data volumes lowering the entry barrier to new research and practical applications.

LePetit

Model

LePetit has what we call a small architecture (12 layers, 256 hidden dimensions, 4 attention heads, 17M parameters). The main difference with the original CamemBERT lies in the use of subword-masking. Indeed, the authors later found out that whole-word masking had at best a marginal impact on downstream task performance.

Apart from inference speed and size considerations, two main factors explain this architectural choice:

  • This is almost the same architecture as ELECTRA-SMALL++, a recently released compact language model. While ELECTRA and CamemBERT differ in many regards (ELECTRA being trained as a discriminator rather than a generator), prior experiments conducted by the ELECTRA team give us an acceptable set of hyperparameters when pre-training and fine-tuning the model.
  • The “Well-Read Students Learn Better” authors observed that depth should be prioritized over width when pre-training compact models. Note that even though for a given parameters budget, depth outperforms width, it comes at a cost in inference speed as observed by the DistilBERT team.
Model size and inference speed: LePetit vs CamemBERT

As illustrated above, LePetit is much smaller and faster than its larger siblings. It provides, respectively, a 4.5-fold and 15-fold inference speed-up when compared to CamemBERT-base and CamemBERT-large while being 6.2 and 18.8 times smaller.

Therefore, LePetit should be used in memory or time-constrained applications. For instance, it may find its purpose on smartphones or in information retrieval systems.

Dataset

Credits: traces1.inria.fr/oscar/fr

Pre-training phase

Pre-training is conducted for 200k training steps, which only took 35 hours on a single Tesla V100 GPU!

The MLM setup. Credits: http://jalammar.github.io/illustrated-bert/. If you’re looking to deepen your understanding of Transformers, I highly recommend checking out Jay Alammar’s blog.

Downstream Evaluation

Text Classification

Dataset

Amazon reviews are wild

Fine-tuning phase

Results

Question Answering

Dataset

Examples with various inference types

Fine-tuning phase

Results

For anchoring, LePetit “fine-tuned” directly on FQuAD without a pre-training phase achieves an F1 score of only 17.76. Regarding larger architectures, CamemBERT-base and CamemBERT-large models trained with a standard two-step process obtain respectively an F1 score of 88 and 92.

Learning from the big cheese: distillation

This will be the subject of a second part investigating classical and advanced distillation strategies. If you want to know more about how LePetit reached an F1 score of 80+ by distilling the knowledge of CamemBERT, stay tuned!

To sum up:

  • We showed that it is much smaller and faster than existing French models.
  • We saw that it is pre-training efficient, relying on few data and computational resources.
  • We evaluated LePetit on two downstream tasks, comparing it to larger models.

Illuin

Illuin Technology builds strategic AI projects (Data…

Micheli Vincent

Written by

Data Science Intern at Illuin Technology — Data Science Master Student at EPFL

Illuin

Illuin

Illuin Technology builds strategic AI projects (Data Science, NLP) and new tech interactions (Mobile, AR, VR, Voice)

Micheli Vincent

Written by

Data Science Intern at Illuin Technology — Data Science Master Student at EPFL

Illuin

Illuin

Illuin Technology builds strategic AI projects (Data Science, NLP) and new tech interactions (Mobile, AR, VR, Voice)

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store