Semantic recommendation system using CamemBERT
Doctrine is a legal intelligence platform that eases the access to a wide variety of legal content to its users: court decisions, regulatory content (laws, ordinances, decrees…), parliamentary documents and doctrinal contents published by third-party websites. For these different types of content, a large quantity of data is produced annually. Thus, in 2019, in France, more than 3.3 millions court decisions were handed down by the different jurisdictions and several tens of thousands of regulatory and legislative texts were published in the “Journal Officiel”.
In this context of continuous flow of new contents, law professionals have to maintain up-to-date knowledge in their law areas in order to defend their clients with the most relevant and up-to-date arguments. It is clearly difficult for them to select manually from this mass of information what is interesting in their domains, that’s why Doctrine offers automatic and personalized legal monitoring tools that aim to select a set of relevant legal contents in their domains such that they will not miss any valuable information in this deluge of data. This tool takes the form of emails sent with the recommended contents. In this article, we are focusing on the weekly sent email which contains recommendations of doctrinal contents also called doctrinal articles. For the avoidance of doubt, it is necessary to specify that these doctrinal articles are written by third-party authors and accessible only on third-party websites. Doctrine acts as a search engine and recommendation system for these doctrinal articles, but does not provide direct access to them.
Let’s see how the prior recommendation system was working and why it has limitations.
Prior recommendation system
The prior recommendation system was pretty simple, this is how it works:
- Index new documents to recommend using the indexing engine Elasticsearch.
- Extract a list of 100 keywords and 100 bigrams with the highest tf-idf score from the list of court decisions and doctrinal articles the user has read, this is where the personalized part is. For the avoidance of doubt, “to read” a doctrinal article here means “has clicked a link and was redirected to the third party website that publishes said doctrinal article”.
- Build an Elasticsearch query using the list of keywords and bigrams weighted by their normalized TF-IDF score, the higher is the score, the higher is the weight.
- Run the query in Elasticsearch to get the 10 documents with the highest matching score
In spite of the simplicity of the system, we did have pretty good results in terms of click through rate in the emails, i.e. the proportion of users who have at least clicked on one of the recommended items over users who have received the email, meaning that the recommendations are relevant. Indeed, we have an click-through rate between 6% and 7% versus an average of 2,6% aggregated on different industries by MailChimp in this study.
However, we were aware of the limitations of this recommendation system (we will call it “syntactic recommendation system” or “syntactic RS” in the following sections):
- The syntactic RS is based on a list of the 100 keywords and bigrams with the highest TF-IDF score, so we lose the information that was contained in the keywords that are not recurring enough to end up in this list.
- The syntactic RS only does keyword matching and manage stemmed versions of these keywords but it does not handle the semantics behind the words. Thus, for an user who has the keyword “contract”, we will not be able to recommend documents containing only synonyms of “contract” such as “agreement” even though the documents would have been relevant.
- The syntactic RS can not be quickly extended to the recommendation of new type of content. We can re-use the list of keywords and bigrams already extracted for the users, even though we may need to enrich the list of terms only coming from the new type of contents. Indeed, a new type of content may not necessary use the same vocabulary than the other types of content (This aspect is highlighted in this other article, written in french by my colleagues Adèle and Pauline for non-technical people, about the challenges of automatically understanding legal language). Besides this time consuming step, we have to index these new contents in Elasticsearch and design the query that will go assign score to content based on matches. All these stages can easily take more than several weeks making iterations slow.
In order to adress these limitations, we had the idea to develop a semantic based recommendation system (we will refer to it with “semantic RS” in the following sections) which should have the following characteristics:
- It should be based on the semantics of the documents.
- It should be generic enough to be used for recommending any type of content (regulation texts, court decisions, doctrinal articles etc…).
- It should provide at least recommendations as relevant as the syntactic RS we want to replace and which was described above. We actually hope to have better recommendations by having the semantic aspects.
To represent our textual contents semantically, we are going to use the French language model CamemBERT, which is to our knowledge the state of art for French at least at the moment we worked on this subject. We did not use the original model but an internally trained one where we used a court decisions dataset to make the model learn the legal vocabulary, if you are interested in how it was trained, you can refer to this excellent article written by my colleague Pauline.
This model will be used to vectorially represent on the one hand the users through their read documents and on the other hand the documents to recommend. Then through the use of a measure of similarity between these representations, we can obtain for each user, the list of the most similar documents.
How to represent an user?
With the goal of building a semantic recommendation system, the first step consists in obtaining a vector representation of a user from the activity he has had on Doctrine, in particular by using the court decisions and doctrinal articles read by the user. Several approaches have been tested to obtain a vector representation of users and finally two of them have been tested in production.
A first method based on keywords/bigrams representation
The first method we tried consists in using the keywords and bigrams that we have already extracted for the syntactic RS. An advantage of this approach is that we can quickly test the relevance of the vector representation thus obtained because we already have these keywords and bigrams without any additional computations.
In practice, for each user, we have a list of 100 keywords and a list of 100 bigrams obtained by retaining the highest TF-IDF scores on all the content read by each user. The advantage with a semantic representation of words is that a limited number of keywords and bigrams are sufficient to capture the information that allows to represent the interests of the users. It also turned out that taking all the keywords and bigrams at our disposal, the results were qualitatively poorer on the representations obtained. It seems that terms with lower scores bring more noise than information.
Thus, we have retained the 20 keywords and the 10 bigrams with the highest TF-IDF scores for each user. A precision here is that we are using the not stemmed version of keywords and bigrams. We give CamemBERT the list of 20 keywords on the one hand and the list of 10 bigrams on the other hand separated by a comma (it’s a “hack” to simulate the structure of a sentence, as BERT is more used to data with a context). An example of keywords sequence would be “victime, pénal, atteinte, …” and an example of bigrams sequence would be “préjudice corporel, infraction pénal, …”. For both sequences, we are getting separately the embedding using the last hidden state of CamemBERT, i.e. the representation of the [CLS] token. Thus, the user is then represented by these 2 embeddings.
This first method is not optimal but the idea is to be able to fastly test the whole production pipeline and have users’ feedbacks for the recommended items.
A second method based on read documents representation
The second method consists in directly representing the content of documents read by the user on Doctrine.
We have chosen to favor doctrinal articles over court decisions in this approach. Indeed, the doctrinal articles are likely to directly target the subjects treated while there is more formalism in the decisions and this could add noise to the vector representation obtained. In addition, we recall that we have a hundred thousand users for whom we must obtain a vector representation and that a user may have read several hundred different documents, so we also have to find a tradeoff in documents to represent and their number in order to stay with reasonable computation time. For the avoidance of doubt, “to read” a doctrinal article here means “has clicked a link and was redirected to the third party website that publishes said doctrinal article”.
Below are the two possible cases for a user:
- For users who have read doctrinal articles, we only use them to represent their interests. By the way, an interesting feature of doctrinal articles is that they have a title that very precisely informs the subject mentioned in the body of the content, so the choice was made to represent a doctrinal content via its title only. This has the advantage of being able to keep all the information provided by the doctrinal articles while keeping a reasonable computation time. We use CamemBERT to obtain the vector representation of each of the titles, then take the average of the vectors obtained to obtain the representation of the user.
- For users who haven’t read doctrinal articles, we use the 50 most important court decisions using the reading time of the user and the 50 most recent court decisions pleaded by the user if the user is a lawyer. Decisions are usually very long, so for the sake of computation time, we’ve restricted each decision to 10 paragraphs in the middle of the content. We are fully aware that we could target even more relevant information by retrieving the most relevant parts of a court decision, for example the “Grounds” section which is the most interessant part, by using the results of what have been done in this article.
Each paragraph is limited to its 512 first tokens as it is the limit size of a sequence for BERT models. The vector representation of the user is then the average of the representation of each paragraph.
In order to quantify the quality of the representations obtained, we analyzed the 5 closest users with a cosine similarity for certain users whose preferred domains are known. The previous 2 methods turned out to represent users well enough to find people with similar interests in these closest neighbors.
How to represent documents to recommend?
For the vector representation of recommendable documents, we take an approach similar to the second method used for user representation. In the case of doctrinal articles, as we face a small volume (several hundreds per week), the computational aspect is less an issue than in the user representation part, thus we are not only restricted to article’s title. So, we were able to test several possibilities on the choice of content used to represent it vectorially, let’s say the article has a title and a content (certain publishers do not allow to index the content of their articles, in which case Doctrine cannot take said content into account):
- Represent the doctrinal article with the vectorial representation of the sequence made up of the first 512 tokens from the content
- Split the doctrine content into paragraphs of 512 tokens, represents each of them separately and take the average of the representations
- Represent the doctrinal article by representing only its title
- Represent the doctrinal article by the average of the representations coming from the title and the paragraphs of 512 tokens, the title has exactly the same weight as a paragraph
To ensure the quality of the obtained representations, we took as a test dataset a set of 200 doctrinal articles for which we know the law area, we represented them using the previous methods. After the qualitative analysis of the representations obtained for each comment by obtaining its 5 nearest neighbors using a cosine similarity in the set of 200 doctrinal articles, the last method (the average of the representations of the title and paragraphs) gives us the most relevant results, i.e. the doctrinal article’s closest neighbors deal with more or less more with the same subject. We therefore decide to use it to represent the doctrinal articles to recommend.
How to make recommendations?
Once we get the vector representation of the user on the one hand and the recommendable doctrinal articles on the other hand, we can start the actual recommendation. The obtained representations are in the form of vectors of dimension 768. The recommendation process is as follows for a given user:
- Take the cosine similarity between the vector representing the user (in the case of the representation with keywords and bigrams, we use the average of the 2 representations) and the vectors of the doctrinal articles that we want to recommend. We are using the cosine similarity as it is the commonly used measure for semantic textual similarity tasks, see papers related to these tasks.
- Recommend the N doctrinal articles with the highest similarity score.
How to measure the relevancy of the generated recommendations?
In order to iterate on our methodology, we looked at some offline metrics. We have for example checked the diversity of recommended doctrinal articles by looking at the proportion of users receiving each of the recommendable doctrinal articles (see the figure below). We can see that for the semantic recommendation system (we used the one where the user is represented using the documents), the recommendations are well distributed, so no red flag in terms of recommendation diversity.
But we mainly manually figured out whether the made recommendations are relevant or not, by simulating recommendations with the semantic system for a list of users for who we know the law areas then by checking that these recommendations are talking about their law areas.
The reason why we did not take more time to look at more offline metrics is that we know that the relevancy ground truth can not be found in offline metrics but in the future users’ interaction with these new recommendations. Therefore, we were planning since the beginning to run an A/B testing. In an A/B testing, we are showing a new feature to a certain proportion of the users and the old one to the remaining proportions. In our case, the old version or control version is the syntactic RS and the variant version is the semantic RS.
In the case of the semantic RS, we tested the 2 variants of the representation for the user described previously, which are the representation based on keywords and bigrams extracted from the content (referred to as “semantic RS keywords” below) and the representation based on the content directly (referred to as “semantic RS documents” below).
As we said before, the recommendations are sent in a weekly email, that’s why, we are mainly interested in 2 classic metrics for emails:
- The open rate, which is the proportion of users who opened the email containing the recommendations
- The click-through rate on opened emails referred to as the click-through rate, which is the proportion of users who clicked on at least one recommendation over users who opened the email and who were therefore able to read the recommendations
The A/B test was conducted on 2 emails sent over 2 different weeks. The first sending puts in competition the syntactic RS against the semantic RS keywords and the second sending concerns the same syntactic system against the semantic RS documents. In both cases, the semantic system was tested on 10% of our users, the remaining 90% receive the recommendations obtained through the syntactic system. We therefore randomly sampled 10% of the users from our user base to form the so called group B. This group will receive recommendations from the semantic system. Group A is then formed from the rest of the users and will receive recommendations from the syntactic system.
After sampling our two groups, we wanted to check on the historical data that the obtained groups were homogeneous on the metrics of open rate and click-through rate to avoid possible sampling bias, we have therefore measured these metrics on a previous sending where the syntactic system was used for all users, and we obtained these results:
There is no significant difference on these 2 metrics, the sampled groups are homogeneous and do not contain any prior bias.
- A/B test between syntactic RS and semantic RS keywords
For this first test, we sent the recommendations obtained via the syntactic system to 90% of the users and those obtained via the semantic system keywords to the remaining 10%. We let users interact with the emails they received, and after a few days we measured open rate and click-through rate. The results obtained are presented in the following table:
Thus, the semantic system where the user representation is based on the keywords provides recommendations which are less relevant than the syntactic system. We have indeed measured a statistically significant drop in click-through rate of 11.5%. It was concluded that representing users by representing their keywords and bigrams is not good enough. This is not surprising a fortiori because BERT provides contextual representations, and therefore works best in representing entire passages of text as paragraphs. This will be confirmed with our second A/B test.
- A/B test between syntactic RS and semantic RS documents
The second test was carried out the week after the first test as the results obtained by the latter were not satisfactory enough. This time it was about testing semantic system documents against the syntactic system. The same user samples were used.
We measured an increase in open rate of 5.1% and an increase in click-through rate of 11.8% for users who received recommendations through the semantic system documents compared to those who received recommendations from the syntactic system. These changes are statistically significant with a confidence greater than 99%. The semantic system whose representation of the user is based on the representation of the content directly is more relevant than the prior system. The increase in the click-through rate of course reflects better relevance, in our case, the increase in the open rate also reflects this, in fact, the title of the emails we send is built from the titles of the recommendations, so it can be inferred that we also have more relevant headlines which prompts users to open more emails.
Finally, after this successful second A/B test, we deployed the solution to all the users.
In this article, we have detailed how we developed a semantic recommendation system which solves the limitations of the prior recommendation system. This new system catches the semantics behind documents to recommend leading to more relevant recommendations as evidenced by the results of A/B testing. And it can be re-used more easily for other types of content, we only have to vectorially represent the new types of content to recommend and compute similarities with the existing user representations.
We also identified opportunities to improve even more this semantic recommendation system. As depicted by the paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, directly using the average embeddings from a BERT model is not as good as an averaged GloVe embeddings. So, by finetuning our already legal language specialized CamemBERT on semantic similary tasks, we can expect to have even more relevant representations and therefore more relevant recommendations.