Fun With Apache Lucene and BERT Embeddings

Dmitry Kan
The Startup
Published in
8 min readNov 15, 2020

--

After publishing the blog post on neural search with BERT and Solr (6.6.0), I got a few questions on how to run this with version 8.6.x of Solr. It took me a few days of going back and forth, and quite honestly a bit of despair, and finally a helping hint from the Lucene committer Adrien Grand (https://twitter.com/jpountz/status/1324093784460873731) to solve. I thought I’d share a few bits on what it took to upgrade vector query functionality from Solr 6.6 to 8.6.x and also explain the nitty-gritty detail of storing the dense embedding in Lucene and querying it in Solr.

Bert with Lucene in mind

Background

The original implementation was published in https://github.com/saaay71/solr-vector-scoring accompanied with an easy to follow tutorial on how to set up vector search in Solr 6.6.0. The plugin allows to index vector data representing your documents and query them, applying document scoring based on cosine or dot product similarity. This plugin is very similar to the Elasticsearch plugin: https://github.com/MLnick/elasticsearch-vector-scoring (currently unmaintained, since Elasticsearch got their own implementation of dense vectors and vector based scoring: https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html offered under X-Pack which requires a commercial subscription).

--

--

Dmitry Kan
The Startup

Founder and host of Vector Podcast, tech team lead, software engineer, manager, but also: cat lover and cyclist. Host: https://www.youtube.com/c/VectorPodcast