Fun With Apache Lucene and BERT Embeddings

Published in

The Startup

8 min readNov 15, 2020

After publishing the blog post on neural search with BERT and Solr (6.6.0), I got a few questions on how to run this with version 8.6.x of Solr. It took me a few days of going back and forth, and quite honestly a bit of despair, and finally a helping hint from the Lucene committer Adrien Grand (https://twitter.com/jpountz/status/1324093784460873731) to solve. I thought I’d share a few bits on what it took to upgrade vector query functionality from Solr 6.6 to 8.6.x and also explain the nitty-gritty detail of storing the dense embedding in Lucene and querying it in Solr.

Background

The original implementation was published in https://github.com/saaay71/solr-vector-scoring accompanied with an easy to follow tutorial on how to set up vector search in Solr 6.6.0. The plugin allows to index vector data representing your documents and query them, applying document scoring based on cosine or dot product similarity. This plugin is very similar to the Elasticsearch plugin: https://github.com/MLnick/elasticsearch-vector-scoring (currently unmaintained, since Elasticsearch got their own implementation of dense vectors and vector based scoring: https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html offered under X-Pack which requires a commercial subscription).

Fun With Apache Lucene and BERT Embeddings

Background

Written by Dmitry Kan