Recent advances in language modeling have led to computationally intensive and resource demanding state-of-the-art models. In an effort towards sustainable practices, we introduce LePetit: a tiny French Language Model.
If you’re looking for a more in-depth analysis, we will soon release a research paper on the importance of pre-training data volume for compact language models. In the meantime, the model is available in the Hugging Face collection!
The need for compact language models
Pre-trained language models have become the norm in Natural Language Processing. These large-scale Transformer-based networks considerably advanced the state-of-the-art in language understanding via a two-step process: self-supervised learning on a vast text corpus followed by fine-tuning on a specific downstream task.
Following these advances, the ongoing trend has been to build bigger models with an ever-increasing amount of data (e.g. RoBERTa) and parameters (e.g. GPT-3). However, pre-training models with billions of parameters over hundreds of gigabytes of text requires tremendous computational resources that only a few companies and institutions can afford. Besides, these cumbersome models introduce significant delays at inference time, especially on non-dedicated hardware. Hence, our goal is to explore model architectures and data volumes lowering the entry barrier to new research and practical applications.
LePetit was inspired by its larger relative CamemBERT. CamemBERT is a multi-layer bidirectional Transformer with two architectures: base (12 layers, 768 hidden dimensions, 12 attention heads, 110M parameters) and large (24 layers, 1024 hidden dimensions, 16 attention heads, 355M parameters). It is very similar to RoBERTa. The main differences are the use of whole-word masking and SentencePiece tokenization instead of subword-masking and WordPiece tokenization. RoBERTa itself improves upon BERT by aggregating several modifications on top of the original architecture such as removing the next sentence prediction task, dynamic masking and training with larger batches on more data.
LePetit has what we call a small architecture (12 layers, 256 hidden dimensions, 4 attention heads, 17M parameters). The main difference with the original CamemBERT lies in the use of subword-masking. Indeed, the authors later found out that whole-word masking had at best a marginal impact on downstream task performance.
Apart from inference speed and size considerations, two main factors explain this architectural choice:
- This is almost the same architecture as ELECTRA-SMALL++, a recently released compact language model. While ELECTRA and CamemBERT differ in many regards (ELECTRA being trained as a discriminator rather than a generator), prior experiments conducted by the ELECTRA team give us an acceptable set of hyperparameters when pre-training and fine-tuning the model.
- The “Well-Read Students Learn Better” authors observed that depth should be prioritized over width when pre-training compact models. Note that even though for a given parameters budget, depth outperforms width, it comes at a cost in inference speed as observed by the DistilBERT team.
As illustrated above, LePetit is much smaller and faster than its larger siblings. It provides, respectively, a 4.5-fold and 15-fold inference speed-up when compared to CamemBERT-base and CamemBERT-large while being 6.2 and 18.8 times smaller.
Therefore, LePetit should be used in memory or time-constrained applications. For instance, it may find its purpose on smartphones or in information retrieval systems.
OSCAR is a recently released large-scale multilingual open source corpora obtained by language classification and filtering of the Common Crawl corpus. The whole French part amounts to 138 GB of text. LePetit was pre-trained on only 2 GB but our experiments revealed that models with similar performance are obtained with as little as 100 MB!
LePetit is pre-trained with the standard masked language modeling (MLM) objective. MLM consists in training a model to predict masked words in a paragraph. That knowledge is then transferred to the downstream task of our choosing, such as Question Answering or Natural Language Inference, where the model is further fine-tuned.
Pre-training is conducted for 200k training steps, which only took 35 hours on a single Tesla V100 GPU!
The French part of the Cross Lingual Sentiment (CLS) dataset is one of the FLUE tasks. It consists of 4000 Amazon reviews for three product categories: books, DVD, and music. We consider the music category. Reviews have an associated rating, ranging from 1 to 5. Those rated higher than 3 are labelled as positive while the rest is labelled as negative (3’s are excluded). Given a review, the task consists in predicting whether it is positive or negative.
After pre-training, LePetit is fine-tuned on the text classification task with the following method: a review goes through the model and a representation of a token prepended to the review is produced. Indeed, this token acts as a paragraph-level embedding. It is then fed to a classification head tasked to discriminate between positive and negative reviews.
LePetit and CamemBERT-base achieve respectively an accuracy score of 88 and 95 on the CLS test set.
The French Question Answering Dataset (FQuAD) is a recently released French native reading comprehension dataset. We consider its new 1.1 version. It consists of 60,000 questions and answers gathered on a set of 1,769 high-quality Wikipedia articles. In many aspects, it is the French equivalent of SQuAD 1.1. Given a question and a paragraph, the task consists in extracting from the paragraph the span of text answering the question.
After pre-training, LePetit is fine-tuned on the question answering task with the same span prediction method as BERT. That is, for a given question/paragraph pair, it predicts for each token in the paragraph its likelihood of being either the start or the end of the expected answer. All the tokens in-between these start and end delimiters (included) will constitute the returned answer.
LePetit reaches an F1 score of 72 and an Exact Match of 58 on the FQuAD validation set. This means that, in average, 72% of the tokens flagged as an answer by the model correspond to the ground truth answer and that 58% of the predicted answers match perfectly the expected answers.
For anchoring, LePetit “fine-tuned” directly on FQuAD without a pre-training phase achieves an F1 score of only 17.76. Regarding larger architectures, CamemBERT-base and CamemBERT-large models trained with a standard two-step process obtain respectively an F1 score of 88 and 92.
Learning from the big cheese: distillation
While question answering is a notoriously difficult task for compact models and performance gaps between compact and large models are smaller on GLUE-like tasks, LePetit can still learn how to answer questions from its elders!
This will be the subject of a second part investigating classical and advanced distillation strategies. If you want to know more about how LePetit reached an F1 score of 80+ by distilling the knowledge of CamemBERT, stay tuned!
To sum up:
- We introduced LePetit, a tiny French Language Model.
- We showed that it is much smaller and faster than existing French models.
- We saw that it is pre-training efficient, relying on few data and computational resources.
- We evaluated LePetit on two downstream tasks, comparing it to larger models.