Using Haystack to create a neural search engine for Dutch law, part 1: why use Haystack?

Felix van Litsenburg
3 min readMar 9, 2022

--

This series explains how Wetzoek, a neural search engine for Dutch law, employs deepset’s Haystack to deliver superior search results. Part 1: why use Haystack? The other parts will cover:

Neural search vs symbolic search

Traditional search engines retrieved keywords, by means of so-called “symbolic search” . This means that they simply browse through documents and make a note of when one or more keywords appear. This approach has often left little room for nuance: searching for the text “summary dismissal”, would mean that the text “he was immediately fired” was not picked up by the search engine.

Most modern search engines are ill-equipped to recognise such variations on the desired search terms. One can explain a few rules to the search engines: for example, that “dismissal” is a synonym for “fired”. You can imagine that this is time-consuming work that leaves a lot of room for error. A small omission on the developer’s part means the search engine no longer functions as effectively.

Moreover, this can still lead to confusion: for example, depending on the context, the word “fired” has very different meanings: compare “he was all fired up” to “he was fired from his job”.

This is where Transformers come in handy. The development of Natural Language Processing (NLP) has led to a marked improvement to language models. They do this by taking into account the context within which words appear when training and deploying language models. So, in the example above, the “fired” in the phrase “he was fired from his job” would be placed closer in vector space to “let go” as in “he was let go from his job”, than the “fired” in “he was all fired up”. This article provides a good primer.

Why is neural search useful?

It is easy to see the strength of Transformers for many fields. Legal research, in particular, is interesting for three reasons:

  1. the interpretation of words is highly context-dependent, and sometimes concerned primarily with the legal realm
  2. accurate and comprehensive legal research may require some creativity to find all tangential terms. If, for example, you are researching the sale of a farm, you may want to search for “livestock”, but perhaps the answer you are looking for uses the word “cattle”, or even just “chickens” or “animals”
  3. current research often requires (junior) lawyers and paralegals to effectively create RegEx queries to search through legal databases

Enter Haystack: out-of-the-box Neural Search engine.

Haystack directly applies Transformers to allow users to build next-generation search engines. In their words, it “is an open-source framework for building search systems that work intelligently over large document collections.” By directly accessing Transformer models, you can set up a superior search engine and add nifty functionality like question-answering and document summarization on top!

But that isn’t all. Haystack is also highly configurable. Once you have your documents set up, it will be quite easy to use Haystack’s pipelines to perform different functionalities. In part 3, we will see how using a query classifier lets us give a superior experience to our users.

In the next piece, we’ll dive into the Dutch legal dataset, Haystack’s building blocks, and how a simple Haystack pipeline is built and can be set up!

--

--