A single legal text representation at Doctrine: the legal camemBERT

Pauline Chavallard
Inside Doctrine
Published in
13 min readMay 4, 2020


As a legal platform, Doctrine aggregates a lot of legal data with the intent of making them accessible, understandable and usable. The Machine Learning Engineers’ day-to-day material is mostly text: court decisions, legislation, legal commentaries, user queries, etc. All of our content is natural language, which we process in a number of ways: bag-of-words, embeddings or with language models.

In an ideal world though, our product would be built on top of scalable, flexible and reusable modules, ones that would be generic enough to accommodate a wide variety of legal contents and feed the whole spectrum of our product features. It is exactly with that vision in mind that we started working on a unified language model a few months ago, whose associated challenges, findings and results we’ll do our best to summarize in this article.

I. One language model to rule them all

Depending on the project, we were representing our legal contents with:

  1. different techniques:
  • TF-IDF vectors
  • BM25 (e.g., with ElasticSearch)
  • A variant of Word2Vec, called Wang2Vec, embeddings fine-tuned on legal data — note that even if those embeddings work pretty well for a lot of tasks, they are not the state-of-the-art anymore. There’s not enough modeling power in simple word embeddings and we definitely see their limits now on some tasks.

2. different data:

  • vocabulary of the content itself,
  • vocabulary of the linked contents from our legal graph
  • vocabulary from some metadata provided by the courts

Yet eventually, we want to be able to represent all of our legal content using a unified framework for any text-understanding based feature, because of:

  1. Reusability: all teams can rely on this unique language model for their projects.

2. Scalability:

  • a modeling power sufficient to be applied to any new legal content (e.g., legal documents from the lower house and the upper house),
  • robust enough to unlock use cases we’re not yet considering, like legal bots, legal trend detection, argument mining, etc,
  • generic enough to be applied to a new language (with a retraining on the new language of course).

3. Agnostic usage: one of the problems with our current representations is that the text follows some guidelines in the way they phrase statements, and a textual similarity is thus strongly biased towards documents that have the same overall phrasing (of the same court for example), despite the fact they’re not invoking the same laws about the same thing. For example, it is now difficult for us to match decisions from the High Court/Court of Appeal to those from the Supreme Court simply because of their different writing styles (the former tends to focus primarily and precisely on the facts, while the latter favors usually only relies on the legal matter, which has an adverse effect on our current representations).

When we initially started thinking about this, there were some properties that we thought our language model should ideally cover:

  1. Taking advantage of the semantic proximity:
  • In French:préjudice corporel should be equivalent to dommage corporel
  • In English: death should be equivalent to loss of life

2. Being able to represent our content on different granularities:

  • Token-level for Named Entity Recognition: anonymization, entity detection, …
  • Paragraph-level: structure detection, argument similarities, …
  • Document-level: legal domain classification, document recommendation, …

It’s with all those things in mind that we started to work on a unique, all-encompassing language model serving all our use cases and features.

II. Our legal language model

The first step of this project was to design the architecture and implementation of our language model. This step was crucial since it would serve as the foundation to all of our future work and help us move towards our initial vision. We first thought about our technical constraints:

  • use an existing and robust implementation, in order to take advantage of the support and the community,
  • use a state-of-the-art technique to achieve very good performances,
  • ideally use a PyTorch implementation, because our previous Deep Learning algorithms were made with PyTorch. Moreover, PyTorch (along with a few others) remains the dominant deep learning library at the time of writing this article,
  • if possible, find an implementation with a French pre-trained model before fine-tuning, because transfer learning has shown its efficiency in NLP.

It should also be noted that compared to other use-cases, especially in academic research, the framework should be efficient at representing very long texts. Here is an interesting blog post about different document embeddings techniques. We’ll come to that later.

Under these constraints, the Hugging Face Transformers library appeared to be a very good choice:

  • they offer all the recent state-of-the-art architectures (BERT, RoBERTa, ELMo, XLNet, …) complete with their associated PyTorch and TensorFlow implementations,
  • some of them have a French pre-trained model,
  • their implementation has quickly become an international reference, to the point where the famous NLP framework Spacy provides a Transformer implementation based on the Hugging Face one.

Among the models providing a French pre-trained model, we had the choice between:

  • BERT-Base, multilingual
  • DistilmBERT, multilingual
  • camemBERT, French RoBERTa model

We decided to go for camemBERT, since it already provided good results for the French language on several tasks according to this paper. Of course, multilingual models will probably be very useful for internationalization later, but we initially wanted to check that a transformer model could be relevant. Moreover, camemBERT has fewer parameters than multilingual models, which makes it a little easier to use.

Note that camemBERT is case-sensitive, which will be useful for Named Entity Recognition and especially for anonymization.

The legal CamemBERT

Now that we had settled on the underlying technology, we decided to check how well it would perform on actual, real-life legal data.

Knowing that camemBERT was initially trained on the French subcorpus of OSCAR, which features gigabytes of data crawled from the web, we knew that it would fare well at general French language tasks, but we suspected that the task of speaking the more specific French legalese would prove to be a tougher nut to crack, which our initial tests confirmed.

For example, when asked to predict the next word of the sentence Par ces ... , camemBERT suggested the word mots, which is not exactly legal-oriented. We would expect something like moyens or motifs.

It was obvious at this point that the trove of millions of legal documents we have at our disposal at Doctrine would prove to be great material for the subsequent fine-tuning needed to harness the full power of our model. At this point, we were confident that the model could be trained, however, we needed it to be potentially used universally across features. Yet, one issue remained: how to handle long texts, a strong prerequisite for legal documents, but something that doesn’t pair naturally with transformers’ inherent limitations.

BERT models, for example, have a hard limit of 512 to 514 tokens (as enforced by the max_position_embeddings parameter), which would surely be a challenge when dealing with court decisions: texts that can be infamously verbose, with an average token count hovering around 2000 (and some even more extreme cases like this decision).

To circumvent this issue, we envisioned two different approaches:

  • Embedding each paragraph
  • Having sliding windows, as explained here

To avoid ending up with redundancy in the embeddings, we decided to go with paragraph embeddings first, with exceedingly long paragraphs getting snipped past the limit during training. What was left for us to determine at that point was an aggregation strategy over the different paragraphs, so that we could harvest the final document embeddings, something that we would come back to later.

We then proceeded with the implementation, which was done by splitting our legal documents on paragraphs and fine-tuning camemBERT on the masked language model task (using dedicated AWS GPU instances). It converged after a few days and we tested its relevance by using a few qualitative checks:

Comparison between the standard pre-trained French camemBERT model and our legal camemBERT on a masked LM task

We assessed the differences in prediction for semantically similar sentences, which seemed to be consistent. The qualitative check seemed to provide very good results. It was now time to validate the language model on a real task.

III. Our first legal camemBERT use-case: classification of legal domain

We wanted to try our legal camemBERT on a simple task for a first validation: text classification of legal domains on court decisions.

This is indeed a simple and well delimited task, and easy to compare to other basic models. Moreover, this classification has a huge product impact, on the search filters, recommender systems and analytics.

We have two hierarchies on the legal domains at Doctrine:

  1. the main legal domain:
  • Droit civil,
  • Droit commercial,
  • Droit social,
  • Droit public,

2. the subdomain: for example in Droit civil, there are

  • Divorce et séparation de corps
  • Droit locatif
  • Droit des successions
  • Droit de la responsabilité

Today, we support 9 different domains and 40 different subdomains, where some are more complex than others to determine. These categories have a hierarchical structure, but we addressed the problem by reducing it to a 40-class classification problem.

The HuggingFace repository suggests a classification head module integrated with CamemBERT. However, as discussed earlier, the main problem is that court decisions can be very verbose (have a look at this very long decision for example), and BERT does not work well on long texts. A very good review of document embeddings showed that there are no clear embedding technique that works better than others for very long documents. It really depends on your objective.

Working at a paragraph level seemed more relevant, all the more so as the language model has been trained at a paragraph scale. BERT will then provide an embedding for each paragraph. We then had to think about a way to aggregate the paragraphs in order to get a decision embedding.


  1. Paragraph embeddings method

It is known that BERT architectures provide not only word-level contextual embeddings but also the special CLS-token whose output embedding is used for classification tasks. However it turns out to be a poor embedding of the input sequence of other tasks if not fine-tuned on the specific task:

  • The paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks from Reimers et al, 2019, shows that BERT out-of-the-box maps sentences to a vector space that is rather unsuitable to be used with common similarity measures like cosine-similarity.
  • According to BERT creator Jacob Devlin: “I’m not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations.source

Still, the most classic ways to embed a document (in our case, a paragraph) with BERT are:

  • to use the [CLS]-token
  • to use an aggregation of the last X hidden states of the word embeddings ( we usually saw X=4)

What is interesting in our case is that one paragraph does not represent the whole court decision. We had to plug something on top of it. We decided to go with the [CLS]-token as paragraph embeddings for a first shot, because our task is a classification task.

2. Document embedding with an aggregation over paragraphs

Given embeddings for all our paragraphs, we then had to think of a way to get document embeddings.

Here again, different approaches can be considered, since this is another sequence-to-one vector modeling:

  1. A simple average of all paragraph embeddings (the [CLS]-token of each BERT-output paragraphs),
  2. A weighted average of the paragraph embeddings, with weights built with a self-attention mechanism explained in the paper A Structured Self-attentive Sentence Embedding,
  3. A bi-LSTM to exploit the sequential information contained in the paragraphs,
  4. A Convolutional Neural Network,
  5. Another BERT that would learn the language at the paragraph scale,

Given that our task is a mere classification problem, the solution with a self-attention mechanism seemed to be pretty relevant for our case because:

  • It’s a bit smarter than a simple average-pooling, and it will automatically get rid of the useless paragraphs that contain no information for the legal domain. Indeed, the final paragraphs of French decisions are often related to the operative part of the judgment, and about who pays the costs. This is usually not relevant to our current problem.
  • It also provides some precious insights on how to best interpret the model. We can indeed have access to the attention weights and check on which paragraphs the model focused on the most for its prediction.

With all that mind, here’s the final architecture for the classification task:

Final architecture of our legal document classification on documents, using the legal camemBERT

We first tried to train the whole pipeline, including the fine-tuning of the legal camemBERT on this task, but we got memory errors. We quickly froze the BERT model and trained only the rest of the pipeline (attention layer + classification layer). It provided good results so we didn’t go with further experiments on an end-to-end training. This is something that we made a note of though, since unsupervised BERT outputs are known to be poor if not fine-tuned, as discussed earlier in this article.


The goal here was not only to improve our legal domains classification, but also to show that we could achieve at least the same results as a simple TF-IDF model.

Dataset creation

Deep learning in general often requires a consequent training set size. That’s why we used a semi-automatically labelled training dataset, labelled:

  • by humans, using Prodi.gy
  • with business rules, using the associated court as a reference. If a decision is linked to another one from Labor court, it’s very likely that the decision is about Droit du travail(labor laws).
  • with the most reliable predictions of our former algorithm, based on TF-IDF for the domain, and a legal taxonomy for the subdomain.

Comparison between models and discussion

We achieved the same performance with our legal camemBERT and with a simple TF-IDF, which is actually good news! We indeed didn’t spend a lot of time on the modeling part of camemBERT, and this classification task is in the end a rather simple NLP task.

Moreover and perhaps just as interestingly, we noticed after a qualitative analysis of model’s prediction errors that the errors of the simple model were more often out of context. It means that when the TF-IDF gets it wrong, it’s really way off the mark. For example, this decision is predicted as Droit du transport with a probability of 0.96, instead of Droit des assurances because the decision is about a vehicle insurance claim and contains a lot of vocabulary related to transportation, and not that much about insurance.

On the other hand, the legal camemBERT can of course be wrong, but it never steers too much out of context and will mostly predict subdomains that are very close, like Droit immobilier et de la construction and Droit de la copropriété et de la propriété immobilière, when we look at the confusion matrix.

Moreover, CamemBERT managed to predict some subdomains that were not obvious at all, even for humans. For example, this decision has been predicted as Divorce et séparations de corpswithout any explicit mention of the word divorce in the decision! The subdomain here is very implicit and implied by a mention to a father that has to pay alimony to the mother of his child.

Let’s now have a look at the attention weights of our modeling. Here are some examples below:

Paragraph with the highest attention score (0.34) for the prediction of https://www.doctrine.fr/d/CA/Reims/2008/SK60FC7292250FC0B001E6 as Divorce et Séparation de corps
Paragraph with the highest attention score (0.26) for the prediction of https://www.doctrine.fr/d/CA/Rouen/2016/1F43DFAE32435B18DC90 as Droit des étrangers et de la nationalité

These attention scores totally make sense, and confirmed the approach.

We also confirmed that paragraphs related to generic procedures had a very low attention weight, like this one:

Paragraph with a very low attention weight of 0.01 for the prediction of https://www.doctrine.fr/d/CA/Rouen/2016/1F43DFAE32435B18DC90 as Droit des étrangers et de la nationalité

Finally, when we had a look at the errors of the models (both models), we quickly noticed that some classes were very well predicted and some others were not. Our intuition about the observed discrepancy boils down to the fact that language models are only ever as good as their training dataset. In our case, the issue seems to stem from volume and errors in the training set. This is definitely the next priority for this task to focus on, before trying to play with the different architectures. Indeed, the current one seems to work pretty well on subdomains when the training dataset is satisfactory.


We built a legal language model with a state-of-the-art technique, that proved to be very efficient at capturing highly relevant information on a simple classification task. This is a huge step for Doctrine, as we have a lot of very complex tasks in Natural Language Processing to tackle! The granularity of this new language model, which can seamlessly provide token, paragraph and document embeddings will be key for us to find new applications for the technique on a wide array of complex Natural Language Processing tasks at Doctrine.

In fact, the legal camemBERT has already found a second problem to tackle with the issue of semantic similarity between users and legal content in the context of a recommendation system and seems to already have yielded promising results, which we’ll be sharing in an upcoming blog post very soon. Stay tuned!