Inside Doctrine
Published in

Inside Doctrine

Semantic recommendation system using CamemBERT

Example of a sent email with recommendations of doctrinal articles

Prior recommendation system

  1. Index new documents to recommend using the indexing engine Elasticsearch.
  2. Extract a list of 100 keywords and 100 bigrams with the highest tf-idf score from the list of court decisions and doctrinal articles the user has read, this is where the personalized part is. For the avoidance of doubt, “to read” a doctrinal article here means “has clicked a link and was redirected to the third party website that publishes said doctrinal article”.
  3. Build an Elasticsearch query using the list of keywords and bigrams weighted by their normalized TF-IDF score, the higher is the score, the higher is the weight.
  4. Run the query in Elasticsearch to get the 10 documents with the highest matching score
Schema of the prior recommendation system
  • The syntactic RS is based on a list of the 100 keywords and bigrams with the highest TF-IDF score, so we lose the information that was contained in the keywords that are not recurring enough to end up in this list.
  • The syntactic RS only does keyword matching and manage stemmed versions of these keywords but it does not handle the semantics behind the words. Thus, for an user who has the keyword “contract”, we will not be able to recommend documents containing only synonyms of “contract” such as “agreement” even though the documents would have been relevant.
  • The syntactic RS can not be quickly extended to the recommendation of new type of content. We can re-use the list of keywords and bigrams already extracted for the users, even though we may need to enrich the list of terms only coming from the new type of contents. Indeed, a new type of content may not necessary use the same vocabulary than the other types of content (This aspect is highlighted in this other article, written in french by my colleagues Adèle and Pauline for non-technical people, about the challenges of automatically understanding legal language). Besides this time consuming step, we have to index these new contents in Elasticsearch and design the query that will go assign score to content based on matches. All these stages can easily take more than several weeks making iterations slow.
  • It should be based on the semantics of the documents.
  • It should be generic enough to be used for recommending any type of content (regulation texts, court decisions, doctrinal articles etc…).
  • It should provide at least recommendations as relevant as the syntactic RS we want to replace and which was described above. We actually hope to have better recommendations by having the semantic aspects.

Methodology

Schema of the semantic recommendation system

How to represent an user?

A first method based on keywords/bigrams representation

A second method based on read documents representation

  • For users who have read doctrinal articles, we only use them to represent their interests. By the way, an interesting feature of doctrinal articles is that they have a title that very precisely informs the subject mentioned in the body of the content, so the choice was made to represent a doctrinal content via its title only. This has the advantage of being able to keep all the information provided by the doctrinal articles while keeping a reasonable computation time. We use CamemBERT to obtain the vector representation of each of the titles, then take the average of the vectors obtained to obtain the representation of the user.
  • For users who haven’t read doctrinal articles, we use the 50 most important court decisions using the reading time of the user and the 50 most recent court decisions pleaded by the user if the user is a lawyer. Decisions are usually very long, so for the sake of computation time, we’ve restricted each decision to 10 paragraphs in the middle of the content. We are fully aware that we could target even more relevant information by retrieving the most relevant parts of a court decision, for example the “Grounds” section which is the most interessant part, by using the results of what have been done in this article.
    Each paragraph is limited to its 512 first tokens as it is the limit size of a sequence for BERT models. The vector representation of the user is then the average of the representation of each paragraph.

How to represent documents to recommend?

  • Represent the doctrinal article with the vectorial representation of the sequence made up of the first 512 tokens from the content
  • Split the doctrine content into paragraphs of 512 tokens, represents each of them separately and take the average of the representations
  • Represent the doctrinal article by representing only its title
  • Represent the doctrinal article by the average of the representations coming from the title and the paragraphs of 512 tokens, the title has exactly the same weight as a paragraph

How to make recommendations?

  • Take the cosine similarity between the vector representing the user (in the case of the representation with keywords and bigrams, we use the average of the 2 representations) and the vectors of the doctrinal articles that we want to recommend. We are using the cosine similarity as it is the commonly used measure for semantic textual similarity tasks, see papers related to these tasks.
  • Recommend the N doctrinal articles with the highest similarity score.

How to measure the relevancy of the generated recommendations?

Distribution of proportion of users receiving the 10 most recommended articles

A/B testing

  • The open rate, which is the proportion of users who opened the email containing the recommendations
  • The click-through rate on opened emails referred to as the click-through rate, which is the proportion of users who clicked on at least one recommendation over users who opened the email and who were therefore able to read the recommendations
A/A testing results for open rate and click-through rate
  • A/B test between syntactic RS and semantic RS keywords
A/B testing results for semantinc RS with keywords representation for users
  • A/B test between syntactic RS and semantic RS documents
A/B testing results for the semantic RS with documents representation for users

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store