Fun With Apache Lucene and BERT Embeddings
--
After publishing the blog post on neural search with BERT and Solr (6.6.0), I got a few questions on how to run this with version 8.6.x of Solr. It took me a few days of going back and forth, and quite honestly a bit of despair, and finally a helping hint from the Lucene committer Adrien Grand (https://twitter.com/jpountz/status/1324093784460873731) to solve. I thought I’d share a few bits on what it took to upgrade vector query functionality from Solr 6.6 to 8.6.x and also explain the nitty-gritty detail of storing the dense embedding in Lucene and querying it in Solr.
Background
The original implementation was published in https://github.com/saaay71/solr-vector-scoring accompanied with an easy to follow tutorial on how to set up vector search in Solr 6.6.0. The plugin allows to index vector data representing your documents and query them, applying document scoring based on cosine or dot product similarity. This plugin is very similar to the Elasticsearch plugin: https://github.com/MLnick/elasticsearch-vector-scoring (currently unmaintained, since Elasticsearch got their own implementation of dense vectors and vector based scoring: https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html offered under X-Pack which requires a commercial subscription).
Google’s BERT got a lot of traction in the NLP community because it allowed Google to improve the search relevance for 10% of their queries. There’ve been a few implementation of BERT with Solr in academia, but not many publicly available tutorials on how to implement it at home, for instance with Solr. That is why I wrote the blog post about going practical with Solr and BERT.
In this post I will go one step more technical and dive into Lucene internals, understanding of which will allow us to upgrade to more recent versions of Lucene and Solr.
BERT embeddings: how to store in Lucene
The solr-vector-scoring plugin is storing the vector embeddings in binary format, each vector dimension encoded as float. On the configuration level it looks like this. As input we pass whitespace delimited vector values in payload format i|j, where i is the dimension id and j is the value along i-th dimension: