Using Haystack to create a neural search engine for Dutch law, part 3: handling a more complex pipeline

3 min readAug 11, 2022

This series explains how Wetzoek, a neural search engine for Dutch law, employs deepset’s Haystack to deliver superior search results. Part 3: Handling a more complex Haystack pipeline.

Part 1: Why Haystack?
Part 2: Setting up a simple Haystack pipeline
Part 3: Handling a more complex Haystack pipeline
Part 4: Trying out different vector databases, Weaviate vs ElasticSearch

Last time, we looked at setting up a simple Haystack Retriever-Reader pipeline and loading data into it. Now, we will look at a more complex Haystack pipeline. Since the last article, Haystack has developed a lot: it is now on version 1.6!

Question-Answering and Ranking

Last week, we looked at a pipeline that took a user query, such as “How is personal data defined?” and searched through our Document Store to get an answer. We noted then, that this is not how everyone searches. A lot of users might enter “definition of personal data” or just “personal data definition”.

Haystack’s Reader component is optimized to answer questions, and so the result from entering a few keywords instead will be markedly different. Therefore, we want to split the pipeline: if the user query is a question, we use Haystack’s Reader. If it’s not a question, we want to use the Ranker. The Ranker takes the documents identified by the Reader, and just reorders them by relevance based on the query.

Pipeline Splitting

But let’s wait a minute: how does Haystack know what’s a question versus what’s a statement? For this, we will need to add another Node: the Query Classifier. This Node takes existing classification models to determine whether a user’s query is a question or more of a statement.

(Note that you could also build a Node that looks for tell-tale signs of a question, like a query ending with a question mark. Don’t forget, however, that lazy users might enter a question but omit the “?” at the end! For this reason, using a trained query classifier could be preferable).

The Query Classifier will split your pipeline in two and route it either through the “question” pipes or through the “statement” pipes. You can specify yourself what happens when either path is taken.

For my legal search engine, I updated the YAML file as such:

components:   
  - name: DocumentStore
    type: ElasticsearchDocumentStore
    params:
      host: localhost
  - name: Retriever-Rank
    type: BM25Retriever
    params:
      document_store: DocumentStore    
      top_k: 5
  - name: Retriever-Read
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 5
  - name: Reader      
    type: FARMReader   
    params:
      model_name_or_path: henryk/bert-base-multilingual-cased-finetuned-dutch-squad2
      use_gpu: True
  - name: DocAnswer  
    type: Docs2Answers 
  - name: Ranker       
    type: SentenceTransformersRanker    
    params:
      model_name_or_path: pdelobelle/robbert-v2-dutch-base
      use_gpu: True
  - name: Classifier       
    type: TransformersQueryClassifier    
    params:
      model_name_or_path: shahrukhx01/question-vs-statement-classifier

Compared to the previous article, a lot of new components have been added. There are two Retrievers, Retriever-Read and Retriever-Rank . Each forms the beginning of the pipe after it has been split by Classifier . Evaluating the structure of the pipeline is illustrative:

pipelines:
  - name: query
    nodes:
      - name: Classifier
        inputs: [Query]
      - name: Retriever-Read
        inputs: [Classifier.output_1]
      - name: Reader
        inputs: [Retriever-Read]
      - name: Retriever-Rank
        inputs: [Classifier.output_2]
      - name: Ranker
        inputs: [Retriever-Rank]
      - name: DocAnswer
        inputs: [Ranker]

As you can see, the Classifier assigns questions to Classifier.output_1 and statements to Classifier.output_2. It then goes through the pipeline for each: Retriever and Reader for questions, Retriever and Ranker for statements.

Docs2Answer

“But hold up!”, you might say. “There’s another node there: DocAnswer. What’s that?”

The DocAnswer node uses Docs2Answer to transform the output format. This is quite important when you are building an API with Haystack. The Ranker normally produces documents as its output, because it does not find an answer to a statement. The Reader produces answers as its output. Because the format for these two are different, the output from your API will be as well depending on how the query is classified. To get around this, the Doc2Answer node transform document-type output into answer-type output, so that you have a uniform format throughout.

Conclusion

Once you have Haystack up and running on a simple pipeline, it’s quite easy to add more complexity. You want to be aware of Haystack’s primitives to make sure the API output is uniform.

Using Haystack to create a neural search engine for Dutch law, part 3: handling a more complex pipeline

Question-Answering and Ranking

Pipeline Splitting

Docs2Answer

Conclusion

Written by Felix van Litsenburg