Using Haystack to create a neural search engine for Dutch law, part 2: setting up a simple pipeline

Felix van Litsenburg
8 min readMar 23, 2022

--

This series explains how Wetzoek, a neural search engine for Dutch law, employs deepset’s Haystack to deliver superior search results. Part 2: Setting up a simple Haystack pipeline.

Last week, we looked at why one should set up Haystack. This week, we are exploring setting up a simple Haystack pipeline to query open data from the Dutch legal system.

The data — parsing for Haystack

The data we will explore will be the Open Data of Dutch case law, which can be downloaded here. I have written a parser to get the data into a number of Excel files, that can be found here. In general, cases include:

  • The text of the case (here it is taken as raw, unstructured data but the XML can — with a fair bit of manual work — be parsed into different sections such as the verdict v.s. preamble)
  • A summary of the case (provided by the Dutch repository)
  • A European standard “ECLI”-type identifier
  • Metadata (e.g. the date the case was published, the date is was helped, the court the case was held at)
Data schema for Dutch legal data for Haystack neural search engine

This data is stored into .csv files, delimited by the symbol ‘|’ and UTF-8 encoded. For 1990–2021, and ignoring files with an empty ‘Text’ field, we get around half a million cases for our search engine. The .csv files will next be loaded into Haystack, but before we get onto that, let’s review the Haystack basics.

Haystack basics: primitives and pipelines

To understand Haystack at a basic level, we must understand what the three essential components are. These are:

  1. Documents — straightforwardly, a document containing ‘content’ (that would be the text field), an ‘id’ (that would be the id field) and metadata
  2. DocumentStore — documents need to be placed in a database for access. Haystack offers a range of options, from ElasticSearch to Weaviate. Depending on your use case, you may want a different DocumentStore. This becomes important in particular in the next instalment of the series, when we look at more complex pipelines that use Dense Passage Retrievers and require document embedding.
  3. Pipelines — with pipelines, you can chain different functionalities (called nodes) together to query the documents in your DocumentStore meaningfully.

For this first, ‘simple’ pipeline we’ll put together just a Question-Answering tool that queries Dutch case law. This uses the Haystack nodes Reader, which performs question-answering; and Retriever, which selects an initial shortlist of documents to perform the QA on.

To get this all going, I whipped up an Ubuntu EC2-instance that’s GPU-enabled. I use one of the AWS Deep Learning AMIs, which has the added advantage of coming with anaconda and the necessary packages. Because Haystack uses Transformers-type models, you’ll want a GPU. Otherwise, the query time for a single question can run into the minutes, especially as you work with larger databases. You must take care that it is large enough and has sufficient RAM as well. A free tier instance will not be enough.

Once you have your instance set up, all you need to do is pull Haystack.

git clone https://github.com/deepset-ai/haystack.git

Setting up a pipeline

Next, we want to set up a Pipeline that runs a user query through a Reader and a Retriever. We navigate to rest_api/pipeline/pipelines.yaml (or pipelines.haystack-pipeline.yml as of release 1.5) where we find the following:

components:    # define all the building-blocks for Pipeline
- name: DocumentStore
type: ElasticsearchDocumentStore
params:
host: localhost
- name: Retriever
type: ElasticsearchRetriever
params:
document_store: DocumentStore # params can reference other components defined in the YAML
top_k: 5
- name: Reader # custom-name for the component; helpful for visualization & debugging
type: FARMReader # Haystack Class name for the component
params:
model_name_or_path: deepset/roberta-base-squad2
context_window_size: 500
return_no_answer: true
- name: TextFileConverter
type: TextConverter
- name: PDFFileConverter
type: PDFToTextConverter
- name: Preprocessor
type: PreProcessor
params:
split_by: word
split_length: 1000
- name: FileTypeClassifier
type: FileTypeClassifier
pipelines:
- name: query # a sample extractive-qa Pipeline
type: Query
nodes:
- name: Retriever
inputs: [Query]
- name: Reader
inputs: [Retriever]
- name: indexing
type: Indexing
nodes:
- name: FileTypeClassifier
inputs: [File]
- name: TextFileConverter
inputs: [FileTypeClassifier.output_1]
- name: PDFFileConverter
inputs: [FileTypeClassifier.output_2]
- name: Preprocessor
inputs: [PDFFileConverter, TextFileConverter]
- name: Retriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [Retriever]

The pipelines file consists of two sections, “components” and “pipelines”. Think of the components section as imports in Python: we need to clarify what functionality we want to use. The pipelines section lets us define how we tie these components together.

The only components we need for our question-answering pipeline will be the DocumentStore, Retriever, and Reader. The standard ones will do, but for the Reader we want to select a different language model. Right now, the selection is an English-language question-answering model. Since we are looking at Dutch data, we want a Dutch question-answering model, if it exists. From the Transformers hub we can pick one, for example this multilingual one finetuned for Dutch. Ideally, we’d want a Dutch-language domain-oriented model, but this might be a bit too much to ask for off-the-shelf!

Plugging that in and removing some unnecessary components, we get the following pipeline:

components:    # define all the building-blocks for Pipeline
- name: DocumentStore
type: ElasticsearchDocumentStore
params:
host: localhost
- name: Retriever
type: ElasticsearchRetriever
params:
document_store: DocumentStore # params can reference other components defined in the YAML
top_k: 5
- name: Reader # custom-name for the component; helpful for visualization & debugging
type: FARMReader # Haystack Class name for the component
params:
model_name_or_path: henryk/bert-base-multilingual-cased-finetuned-dutch-squad2
context_window_size: 500
return_no_answer: true
pipelines:
- name: query # a sample extractive-qa Pipeline
type: Query
nodes:
- name: Retriever
inputs: [Query]
- name: Reader
inputs: [Retriever]

Let’s take a quick step back and see how this fits into the bigger picture. This pipeline can be used to underpin a REST API, which in turn can serve a front end. To build something like Wetzoek, you’d have a structure like so:

Diagram showing a set up to use Haystack to power a front end
Incorporating Haystack in a SaaS

A quick word on each component:

  • the DocumentStore we defined up top already: it’s where we store our documents
  • the Retriever quickly passes through the full DocumentStore and generates a shortlist of documents to generate an answer from. Haystack retrievers come in two flavours: “sparse” and “dense”. The standard, sparse retriever does not require word embeddings, this is reserved for the Dense Passage Retrieval. This means it is quicker, but does not bring the semantic search benefits of Haystack and is therefore reliant on exact keyword matches. Next week, we’ll look at Dense Passage Retrieval.
  • the Reader is the question-answering capability of Haystack: it returns not just a specific document, but an answer containing a string for the exact answer, a string for the context, and a confidence score

With the pipeline specified, we can go into the main Haystack folder and see if we can get the API running. First, we need to tell Haystack that we want it to load the pipeline. Go into the Haystack folder. Depending on whether you are running the GPU or the CPU, you may want to replace “docker-compose.yml” with “docker-compose-gpu.yml” or “docker-compose-cpu.yml”, respectively.

Then, make sure to actually edit this file and un-comment the lines that load the custom pipelines:

volumes:
- ./rest_api/pipeline:/home/user/rest_api/pipeline

Moreover, you’ll want to set up a separate, empty ElasticSearch instance by editing the template like this:

elasticsearch:
# This will start an empty elasticsearch instance (so you have to add your documents yourself)
image: "elasticsearch:7.9.2"
# If you want a demo image instead that is "ready-to-query" with some indexed articles
# about countries and capital cities from Wikipedia:
#image: "deepset/elasticsearch-countries-and-capitals"

Spinning up Haystack

Haystack is nicely captured in a docker container, so let’s get it running. Navigate to the Haystack directory and run:

sudo apt-get install docker
sudo apt install docker-compose
sudo service docker restart
docker-compose up

You want to do this in a separate screen, by the way, so that you can let Haystack run in the background. Moreover, on the first run a lot of packages need to be downloaded, so it may take a while (about 10 minutes)!

Once this has been done, we have a running pipeline. You can check out the Streamlit app on port :8501. Of course, make sure you have the appropriate security settings attached to your instance.

On your separate screen, you should see:

Screenshot of the Terminal of a successfully running, CPU-enabled Haystack instance
Haystack running on CPU

Loading files into Haystack

Now we can begin loading files into Haystack. You’ll want to create a folder to store your data. You can then transfer your CSV files to this folder using e.g. Filezilla.

First, let’s get our environment in order. Since we use an AMI Deep Learning instance, we will have a lot of the required data already. Navigate to the Haystack folder and:

conda create -n python38 python=3.8
conda activate python38
pip install --upgrade pip
pip install -e .[all] ## or 'all-gpu' for the GPU-enabled dependencies

Then, navigate to your data folder, get the haystack_load.py file in place, and run it.

Checking if it worked

If everything worked well, you should get progress output on your terminal like this:

Fortunately, there are some other methods to check if everything has gone well. First off, you can check it out in the Streamlit app on port 8501. For some reason, the surrounding text was still set on the Haystack demo. But the neural search worked just fine:

Haystack Streamlit app

You can also go to the Swagger docs on :8000/docs and even test out the API here. Lastly, on port :9200 you can access the ElasticSearch engine. A tool like ElasticSearchHead is great for evaluating the data you have uploaded.

Concluding remarks

We now have our Haystack engine set up! To turn this into a production-ready tool, we’d of course want to build a nice front-end on it. That would be the topic of an entirely different series, however.

More interestingly: we now have a very simple pipeline. Haystack is now set up for question answering. For example, if you were to ask Haystack “What is the definition of IP theft?” you would get a very different output than if you asked “definition of IP theft”. If you look at popular Google queries, you would understand a lot of users put in the second type of querying. After all, it’s quicker to write a statement like that, than it is to write out a full question!

In the next article in the series, we will build a more complex pipeline. In particular, we will take on two challenges: first, can we treat different types of user queries differently? (Answer: yes). Second, what more cool stuff can we do with the available Haystack pipelines?

--

--