Last year, SIGIR — a top international conference on Information Retrieval (IR) — took place in Paris. It was thus a tremendous opportunity for Doctrine to stay up-to-date in the fast moving field of search, the main feature of our platform. In this article, we would like to describe how research impacts Doctrine (and vice versa), as well as progressively dive into the technical aspects of the conference we appreciated.
Doctrine’s activity in research
Besides attending many very interesting talks, of which we shortlisted a selection further down in this article, we also got the opportunity to give an invited talk for the industry track (the video of the talk can be downloaded on this page). At Doctrine, we frequently gather in the data science team to discuss new or state-of-the-art approaches. This allows us to always propose cutting edge features to our customers, because we always apply the current best algorithms in the field and we adapt them for the specific domain that is legal research.
Still, beyond just reviewing papers together, we also try to attend international conferences, like SIGIR, because we believe that attending is a lot more proactive and motivating than just reading research outputs. It’s also a fantastic opportunity to share our methods with others and better understand authors of a paper with their presentation and/or follow up discussions. Because SIGIR has a pretty dense program (with up to 4 parallel sessions), we decided to have 3 of our Data Scientists attend it: Nicolas Fiorini, Geoffroy Nicolas and Nathan Tedgui.
The next step for us is then to publish a peer-reviewed article on one of our numerous innovative approaches, but this will be the topic of another post. In the meantime, we also continue to participate in various international events, for example we gave a talk a Search Solution 2019 in November. This year, we will present some of the search challenges we face during the Industry Day at ECIR 2020 — feel free to join and meet us there!
IR trends and lessons learned
Before going to our paper selection, we think it’s relevant to provide our understanding of where the field stands and where it’s going — note, however, that we focus on AI only and this article won’t cover user studies despite their value. The field is currently split in two major subfields:
- Traditional IR: this includes more or less traditional ranking models such as BM25, their tweaks and improvements, or machine learning (ML) approaches like learning-to-rank (LTR), classification, etc.
- Neural IR: this one is quite self-descriptive and covers all attempts of applying deep neural networks for search, its scalability, its performances (especially comparatively with more traditional approaches).
Surprisingly, while we would expect the domain to shift towards neural IR, we attended many talks on more traditional, ML-based methods. To us, it meant that gradient boosted tree approaches were still very well represented in production systems. It’s understandable, as they are fast, scalable approaches with nice performances which makes them good production candidates. They are, in fact, what we also use for the search at Doctrine. Particularly, LambdaMART seems to always be driving a lot of the research although it was released more than 10 years ago. The best paper award (see further down) was given to a gradient exploration optimization that could lead to training LambdaMART faster in an online learning setup.
Online learning. That’s a phrase we also heard quite a bit this year. The main reason for having online LTR is the fact implicit user feedback (clicks, success events, etc) is automatically generated and exploitable. The assumption here is: if we can train a ranking model as our users use the system, we will be more flexible (if trends change, for example) and it will be quicker to converge towards a good model compared to running the system live for months and derive a training set from query logs only.
Finally, a pretty large body of research was dedicated to neural IR. However, this felt a lot more exploratory, where speed is not always mentioned. For many papers, in fact, the objective was not to perform better than LambdaMART-like approaches, but rather to do better than other neural IR methods and see how they behave:
- what works now that didn’t use to?
- why don’t we see similar or better performances than with LambdaMART
- what does LambdaMART fail to do that neural IR does well?
It feels like these questions are still unanswered at this point, but also that the field is making consistent progress. Neural IR will definitely replace more traditional IR in the future, especially if the number of discoveries in neural IR continues to grow rapidly.
We share here a selection of 3 papers we think proposed interesting approaches, concepts or innovations. This is certainly biased towards our applications, but we thought it could be of interest to some of our readers!
This is the best paper award of this year. And it truly deserves it. Online learning-to-rank is (in)famous in industry because it takes time to converge to a satisfying model. And, in the meantime, users are impacted with poor search result quality. This is mainly because of variance generated from exploring random gradients — whereas during batch training, gradients are defined according to the objective function to be optimized. The authors here propose a smart way to reduce the space of features being updated to make sure they evolve in the right direction, and faster than previous online learning-to-rank approaches.
The motivation of this work is that a single ranking model for enterprise search (say, the Gmail suite of the company you work at) is not sufficient to address the much wider set of user needs. While the internet is the same for everyone — but of course, there’s personalization — email accounts contain personal data that would benefit from a better tailored search engine. Since training a new model from scratch for each company is not feasible (they don’t necessarily generate tons of data), the authors propose a domain adaptation of the general model that will be fine tuned to the specific enterprise needs. To us, this is a very scalable and smart way to address the issue.
NDCG is often optimized indirectly in learning-to-rank algorithms, by integrating it in the loss function. This allows to optimize the overall ranking despite the fact we may evaluate pairs or even single documents only. This is usually fine because traditional LTR such as LambdaMART use indirect boosting: gradient are weighted with NDCG, but they are defined from a differentiable cost function. In this paper, the authors propose an approximation of NDCG that is differentiable and that can be directly added in the cost function. All algorithms can thus optimize this approximation directly.
If these topics are of interest to you and if you want to revolutionize legal research, apply and come share the fun!