Using Haystack to create a neural search engine for Dutch law, part 4: trying out different vector databases, Weaviate vs ElasticSearch

Felix van Litsenburg
5 min readApr 21, 2023

--

This series explains how Wetzoek, a neural search engine for Dutch law, employs deepset’s Haystack to deliver superior search results. Part 4: trying out different vector databases, Weaviate vs ElasticSearch

It has been almost nine months since the last article, and since then the NLP space, Wetzoek, and Haystack have seen a lot of progress. At Wetzoek, we have added an automated filter setting functionality based on the labels identified through graph theory. An in-depth report will probably follow in an upcoming article.

One of the reasons for setting filters before doing neural search (or any search) is that it speeds up retrieval significantly. After all, if you tell the machine to only look through e.g. civil law, then you reduce the document scope by roughly a third! Now imagine doing this for even smaller domains.

This is only necessary if search is slow already, which it has been on Wetzoek. Haystack’s logs very helpfully tell you how long it takes to return a query:

Fragment of a Haystack query return in the API, showing the time taken

In some of the worst cases, this could take almost a minute! And that’s using a GPU-enabled instance (g4dn on AWS) created for inference…

Embeddings and splitting documents

Neural search does not do well on large documents, because they contain too many tokens. Therefore, Wetzoek’s database of ~600,000 documents has been split into a rather enormous 18 million split documents of around a 100 tokens each. As you can imagine, this poses quite the challenge for fast search retrieval, and would explain why some searches might take up to a minute.

To slow things down even more, each document has an embedding. Remember, we are using neural search and so we are looking for semantic similarity, not keywords. That means that under the hood, our query is converted into a vector. This vector is then compared to the vector representations of the embedded documents, to find the most similar one in vector-space. This is a specialised type of search that requires specialised databases.

Vector databases

While ElasticSearch has functionality for vector embeddings, it is not built entirely with that purpose in mind. Fortunately, there are many vector databases available. There are several listed on Haystack’s DocumentStores page, including Weaviate, PineCone, and Qdrant. Another one not listed is Chroma.

At the time, I picked Weaviate mainly because its integration with Haystack allowed filtering. This was not yet possible for Faiss or PineCone, and an integration with Qdrant did not yet exist. As it so happens, as of writing this article, Weaviate has just raised $50mn in series B funding — so this primer is quite timely!

Setting up Weaviate

When setting up Weaviate through Haystack, I found this article to be immensely helpful. In this case, I used weaviate 1.17.3. Following the steps in the article was enough to get me set up on Weaviate, after some bug fixing. Both the Haystack and Weaviate communities on, respectively, Discord and Slack were immensely helpful. C-suite members from both companies were often the first to reply to questions!

Nonetheless, there were a few challenges I faced still. In particular, Weaviate only accepts documents that are already embedded. This is no problem if you use Haystack’s Uploading pipeline, as outlined in the article mentioned above. But in my case, I was used to updating embeddings periodically like this:

    document_store = ElasticsearchDocumentStore(analyzer=language, index=doc_index, timeout=300, similarity="cosine")
tic = time.perf_counter()
retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="jegormeister/bert-base-dutch-cased-snli",
model_format="sentence_transformers",
)
print("Updating embeddings")

document_store.update_embeddings(retriever,update_existing_embeddings=False)

I therefore had to update my data-loading pipeline to embed documents before writing them to the DocumentStore. My solution was to build a “pre-embedder” function where you have to feed in the Haystack docs:

def pre_embedder(docs):
print('Running the pre-embedding')
retriever = EmbeddingRetriever(
document_store=split_document_store, # the name used for the Weaviate DS
embedding_model=embedding_model, # in practice, this was the jegormeister Dutch transformers
model_format=model_format, # in practice, sentence_transformers again
)
embeds = retriever.embed_documents(docs)
for doc, emb in zip(docs,embeds):
try:
doc.embedding = emb
except Exception as e:
print(e)
return docs

This worked out fine, and in fact I would not be surprised if the latest Haystack iteration (this was built on 1.10, Haystack’s currently on 1.15.1) no longer requires this.

Weaviate outperforms ElasticSearch

To test the pipeline I set up previously on performance, I queried both ElasticSearch and Weaviate ten times, at different numbers of total documents. I captured the output in an Excel table for 1k, 10k, and ~100k docs. Remember — each document represents ~60x as many documents when split into manageable chunks! Time shown are seconds.

As you may expect, for small numbers of documents, Weaviate and Haystack are roughly level in terms of speed. As the number of documents increases, however, we really see Weaviate shine. Whereas ElasticSearch can take up to a minute for some queries, Weaviate consistently takes less than a minute to return results.

(Another interesting thing I’ve noticed is that the first queries take longer, almost as if the engine is ‘warming up’. Because I have seen this happen for other queries too, I don’t think it’s to do with the specific queries used here).

Weaviate’s performance comes at a price

Reaching 100k documents in Weaviate was not an easy task. My instance kept crashing as I was writing new documents to Weaviate! Very helpfully, the people at Weaviate explained that I needed more RAM, and a lot of it. Fortunately, I could just show down my EC2 instance and upgrade it. I found I needed a g4dn.8xlarge instance to write 100k documents to Weaviate successfully.

For my purposes with Wetzoek at the moment, this was overkill (and far too expensive). For a professional outfit, however, I can imagine Weaviate’s performance is worth the price.

Summary and the future of search

While it took a little bit of wrangling, I was quite pleased by how easy it was to port my Haystack setup on ElasticSearch to Weaviate. The Haystack layer really adds a lot of benefit here, as it makes it much easier for beginners like me to add metadata, set filters, etc. than when using the formats of the underlying search technology.

The interest in semantic search has been booming lately, as not just Weaviate but also Pinecone have raised money. As a user of search myself, however, I believe the power of symbolic search should not be underestimated. Often, matching some exact keywords is what the user wants; not the semantic similarity.

This is certainly a feature I will add to Wetzoek, allowing users to do semantic or symbolic search. In the future, I can imagine a new way of writing search queries: where the user specifies which part of the query should be treated ‘semantically’ and which part should be treated ‘symbolically’; e.g.: (achievements of) [julius caesar] (as emperor), where [julius caesar] instructs symbolic search, but (achievements of) and (as emperor) instruct semantic search.

--

--