Using Haystack to create a neural search engine for Dutch law, part 3: handling a more complex pipeline
This series explains how Wetzoek, a neural search engine for Dutch law, employs deepset’s Haystack to deliver superior search results. Part 3: Handling a more complex Haystack pipeline.
- Part 1: Why Haystack?
- Part 2: Setting up a simple Haystack pipeline
- Part 3: Handling a more complex Haystack pipeline
- Part 4: Trying out different vector databases, Weaviate vs ElasticSearch
Last time, we looked at setting up a simple Haystack Retriever-Reader pipeline and loading data into it. Now, we will look at a more complex Haystack pipeline. Since the last article, Haystack has developed a lot: it is now on version 1.6!
Question-Answering and Ranking
Last week, we looked at a pipeline that took a user query, such as “How is personal data defined?” and searched through our Document Store to get an answer. We noted then, that this is not how everyone searches. A lot of users might enter “definition of personal data” or just “personal data definition”.
Haystack’s Reader component is optimized to answer questions, and so the result from entering a few keywords instead will be markedly different. Therefore, we want to split the pipeline: if the user query is a question, we use Haystack’s Reader. If it’s not a question, we want to use the Ranker. The Ranker takes the documents identified by the Reader, and just reorders them by relevance based on the query.
Pipeline Splitting
But let’s wait a minute: how does Haystack know what’s a question versus what’s a statement? For this, we will need to add another Node: the Query Classifier. This Node takes existing classification models to determine whether a user’s query is a question or more of a statement.
(Note that you could also build a Node that looks for tell-tale signs of a question, like a query ending with a question mark. Don’t forget, however, that lazy users might enter a question but omit the “?” at the end! For this reason, using a trained query classifier could be preferable).
The Query Classifier will split your pipeline in two and route it either through the “question” pipes or through the “statement” pipes. You can specify yourself what happens when either path is taken.
For my legal search engine, I updated the YAML file as such:
components:
- name: DocumentStore
type: ElasticsearchDocumentStore
params:
host: localhost
- name: Retriever-Rank
type: BM25Retriever
params:
document_store: DocumentStore
top_k: 5
- name: Retriever-Read
type: BM25Retriever
params:
document_store: DocumentStore
top_k: 5
- name: Reader
type: FARMReader
params:
model_name_or_path: henryk/bert-base-multilingual-cased-finetuned-dutch-squad2
use_gpu: True
- name: DocAnswer
type: Docs2Answers
- name: Ranker
type: SentenceTransformersRanker
params:
model_name_or_path: pdelobelle/robbert-v2-dutch-base
use_gpu: True
- name: Classifier
type: TransformersQueryClassifier
params:
model_name_or_path: shahrukhx01/question-vs-statement-classifier
Compared to the previous article, a lot of new components have been added. There are two Retrievers, Retriever-Read
and Retriever-Rank
. Each forms the beginning of the pipe after it has been split by Classifier
. Evaluating the structure of the pipeline is illustrative:
pipelines:
- name: query
nodes:
- name: Classifier
inputs: [Query]
- name: Retriever-Read
inputs: [Classifier.output_1]
- name: Reader
inputs: [Retriever-Read]
- name: Retriever-Rank
inputs: [Classifier.output_2]
- name: Ranker
inputs: [Retriever-Rank]
- name: DocAnswer
inputs: [Ranker]
As you can see, the Classifier assigns questions to Classifier.output_1
and statements to Classifier.output_2
. It then goes through the pipeline for each: Retriever and Reader for questions, Retriever and Ranker for statements.
Docs2Answer
“But hold up!”, you might say. “There’s another node there: DocAnswer
. What’s that?”
The DocAnswer node uses Docs2Answer to transform the output format. This is quite important when you are building an API with Haystack. The Ranker normally produces documents as its output, because it does not find an answer to a statement. The Reader produces answers as its output. Because the format for these two are different, the output from your API will be as well depending on how the query is classified. To get around this, the Doc2Answer node transform document-type output into answer-type output, so that you have a uniform format throughout.
Conclusion
Once you have Haystack up and running on a simple pipeline, it’s quite easy to add more complexity. You want to be aware of Haystack’s primitives to make sure the API output is uniform.