# Fun With Apache Lucene and BERT Embeddings

After publishing the blog post on neural search with BERT and Solr (6.6.0), I got a few questions on how to run this with version 8.6.x of Solr. It took me a few days of going back and forth, and quite honestly a bit of despair, and finally a helping hint from the Lucene committer Adrien Grand (https://twitter.com/jpountz/status/1324093784460873731) to solve. I thought I’d share a few bits on what it took to upgrade vector query functionality from Solr 6.6 to 8.6.x and also explain the nitty-gritty detail of storing the dense embedding in Lucene and querying it in Solr.

## Background

The original implementation was published in https://github.com/saaay71/solr-vector-scoring accompanied with an easy to follow tutorial on how to set up vector search in Solr 6.6.0. The plugin allows to index vector data representing your documents and query them, applying document scoring based on cosine or dot product similarity. This plugin is very similar to the Elasticsearch plugin: https://github.com/MLnick/elasticsearch-vector-scoring (currently unmaintained, since Elasticsearch got their own implementation of dense vectors and vector based scoring: https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html offered under X-Pack which requires a commercial subscription).

Google’s BERT got a lot of traction in the NLP community because it allowed Google to improve the search relevance for 10% of their queries. There’ve been a few implementation of BERT with Solr in academia, but not many publicly available tutorials on how to implement it at home, for instance with Solr. That is why I wrote the blog post about going practical with Solr and BERT.

In this post I will go one step more technical and dive into Lucene internals, understanding of which will allow us to upgrade to more recent versions of Lucene and Solr.

## BERT embeddings: how to store in Lucene

The solr-vector-scoring plugin is storing the vector embeddings in binary format, each vector dimension encoded as float. On the configuration level it looks like this. As input we pass whitespace delimited vector values in payload format i|j, where i is the dimension id and j is the value along i-th dimension:

`"vector":"0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "`

To store such values in Solr, we define the following field type:

`<fieldType name="VectorField" class="solr.TextField" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" termVectors="true" storeOffsetsWithPositions="true">`

<analyzer>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>

</analyzer>

</fieldType>

Under the hood the solr.DelimitedPayloadTokenFilter uses DelimitedPayloadTokenFilter and it encodes anything to the right of the payload delimiter | in the requested format (float). And this data structure is sent to Lucene index for the given document.

So in order to store dense BERT embeddings in the search engine, we first compute them. For the word “mathematics”, we gonna have 768 dimensional vector (the exact number of dimensions depends on the BERT model you have chosen):

`0.31364498,-0.081439406, …, -0.052598316`

This vector gets transformed into the payload form that Solr understands:

`0|0.31364498 1|-0.081439406 … 767|-0.052598316`

This string is then indexed as a field of type VectorField defined above.

## Computing similarity score for dense vectors

When you want to query against this dense vector representation, you need to compute a very similar representation for your input query and compute a distance between the vectors. The solr-vector-scoring plugin supports two modes of distance computation: cosine similarity and dot product similarity.

The plugin implements **two classes **as building blocks for both cosine and dot product scoring: **VectorQuery** class and **VectorScoreQuery** class. **VectorQuery** provides a scorer that will iterate all documents in your index and compute the distance between a document embedding and a query embedding:

`@Override `

**public** Scorer scorer(LeafReaderContext context) **throws** IOException {

**return** **new** ConstantScoreScorer(

**this**,

score(),

DocIdSetIterator.all(context.reader().maxDoc()));

}

In the above method, score() will return the score value, which is the measure of similarity (cosine or dot product based). And this scoring is going to run for exactly each document in the index due to DocIdSetIterator is configured with visiting all documents. This is by the way a point of further optimization: instead of visiting all documents, we can do something more clever based on finding nearest neighbours (this is my next challenge to understand and implement).

**VectorScoreQuery** class implements the “meat” of the vector score computation logic by extending the class CustomScoreQuery found in org.apache.lucene.queries (you need to have Lucene of at least 6.6.0, up to 7.7.3 where it is marked as deprecated). This class computes receives the query vector and computes its distance to the document vector by reading the vector values directly from the Lucene index. Here is the relevant method implemented in the anonymous class instance of CustomScoreProvider class:

`@Override`

**public float **customScore(**int **docID, **float **subQueryScore, **float **valSrcScore) **throws **IOException {

**float **score = 0;

**double **docVectorNorm = 0;

LeafReader reader = **context**.reader();

Terms terms = reader.getTermVector(docID, **field**);

**if**(**vector**.size() != terms.size()){

**throw new **SolrException(SolrException.ErrorCode.*BAD_REQUEST*, **"indexed and input vector array must have same length"**);

}

TermsEnum iter = terms.iterator();

BytesRef text;

**while **((text = iter.next()) != **null**) {

String term = text.utf8ToString();

**float **payloadValue = 0f;

PostingsEnum postings = iter.postings(**null**, PostingsEnum.*ALL*);

**while **(postings.nextDoc() != DocIdSetIterator.*NO_MORE_DOCS*) {

**int **freq = postings.freq();

**while **(freq-- > 0) postings.nextPosition();

BytesRef payload = postings.getPayload();

payloadValue = PayloadHelper.*decodeFloat*(payload.**bytes**, payload.**offset**);

**if **(**cosine**)

docVectorNorm += Math.*pow*(payloadValue, 2.0);

}

score = (**float**)(score + payloadValue * (**vector**.get(Integer.*parseInt*(term))));

}

**if **(**cosine**) {

**if **((docVectorNorm == 0) || (**queryVectorNorm **== 0)) **return **0f;

**return **(**float**)(score / (Math.*sqrt*(docVectorNorm) * Math.*sqrt*(**queryVectorNorm**)));

}

**return **score;

}

Essentially, the reversing of payload encoding is going on here: we call decodeFloat() method on payloads corresponding to vectors stored in the index for the given document with id docID. If cosine similarity is requested, we compute normalization of the score by product of the norms of document and query vectors.

## Querying dense vectors with Solr

The same implementation provides the VectorQParserPlugin configured like so:

`<queryParser name="vp" class="com.github.saaay71.solr.VectorQParserPlugin" />`

The query param q that usually contains the query keywords gets transformed into using the local parameters syntax, in which we pass the dense vector representation of the query, instead of its plain keywords:

`q={!vp f=vector vector="0.1,4.75,0.3,1.2,0.7,4.0"}`

Again query above is a short example to fit on the screen. In case of BERT, the real vector will have 768 dimensions and it will overflow the length of what GET request can process. Therefore, you will need to send it as POST. I’ve in fact implemented a demo with streamlit library (check it out if you haven’t yet!).

## Upgrading to 8.6.x

Now that we know the inner workings of the vector indexing and querying, let’s take a look at upgrading the solr-vector-scoring plugin to 8.6.x version of Solr. The trick is that, as I mentioned above, CustomScoreQuery got deprecated in 7.7.3 and removed in 8.0.0. Before removing it, lucene offered a hint: use *org.apache.lucene.queries.function.FunctionScoreQuery. *However, if you look at its source code, you will be surprised that it is declared as final and therefore you can’t inherit from it. But, thanks for the hint from Adrien Grand, I was able to wrap my input query of type VectorQuery into the FunctionScoreQuery. And one missing ingredient was.. instance of a DoubleValuesSource. This class will do the core of the work similar to what VectorScoreQuery did in CustomScoreProvider::customScore() implementation.

But that wasn’t as easy.

At first, I’d moved all the code for similarity computation into the doubleValue() method of anonymous DoubleValues class. This class has two logical parts: computing the value in doubleValue() method and advancing to the next document in advanceExact(int doc) method.

For inspiration I looked at sub-classes of DoubleValuesSource in lucene, closest of which was TermFreqDoubleValuesSource class, which computes the term frequency for a given term. Here is what it did:

**public **DoubleValues getValues(LeafReaderContext ctx, DoubleValues scores) **throws **IOException {

Terms terms = ctx.reader().terms(**this**.term.field());

TermsEnum te = terms == **null **? **null **: terms.iterator();

**if **(te != **null **&& te.seekExact(**this**.term.bytes())) {

**final **PostingsEnum pe = te.postings((PostingsEnum)**null**);

**assert **pe != **null**;

**return new **DoubleValues() {

**public double **doubleValue() **throws **IOException {

**return **(**double**)pe.freq();

}

**public boolean **advanceExact(**int **doc) **throws **IOException {

**if **(pe.docID() > doc) {

**return false**;

} **else **{

**return **pe.docID() == doc || pe.advance(doc) == doc;

}

}

};

} **else **{

**return **DoubleValues.EMPTY;

}

}

As you can see, it tracks PostingsEnum for a given term and advances the postings to the next document after extracting the term frequency for the current document.

But I don’t have an exact term to load! This is because an input query can have *any *float values representing the query embedding and the only reason why exact same sequence of floats would appear in my index is because I’d stored that exact document. Which defeats the purpose. We want to find documents that are *similar* to our query’s vector, not exactly the same. And using BERT embeddings we aspire to find *semantically similar* documents.

So next followed the remote debugging session with Intellij IDEA and Solr in the attempt to understand the process of search. If you are like me and need to debug Solr occasionally — I documented how to do it here. During the session, I could clearly see, that if advanceExact() returned false, very logically the doubleValue() wouldn’t be called for the given document and a default score of 0.0 would be returned to the client. As I was mimicing the implementation in TermFreqDoubleValuesSource, I’d run into various issues like ArrayOutOfBoundsException and NullPointerException.

So I took a step backward and thought about what building blocks do I need. First of all, as I didn’t have an exact term to search for, I had to iterate over all documents. Next, for each document I needed to extract the payload value and compute the necessary values like norms and the final score of the distance. Last, but not least, I needed to have access to *all* of the terms of the given vector field.

This brought me to thinking, that advanceExact(int doc) method should be bound to Terms and TermsEnum for the query field in a given document:

**public boolean **advanceExact(**int **doc) **throws **IOException {

**terms **= reader.getTermVector(doc, **field**);

**if **(**terms **== **null**) {

**return false**;

}

**te **= **terms**.iterator();

**return true**;

}

And after we get TermsEnum instance te, we can compute the score by iterating all payload components (we will have exactly 768 of them) like so:

**public double **doubleValue() **throws **IOException {

**double **docVectorNorm = 0.0;

**double **score = 0;

BytesRef text;

**while **((text = **te**.next()) != **null**) {

String term = text.utf8ToString();

**if **(term.isEmpty()) {

**continue**;

}

**float **payloadValue = 0f;

PostingsEnum postings = **te**.postings(**null**, PostingsEnum.*ALL*);

**while **(postings.nextDoc() != DocIdSetIterator.*NO_MORE_DOCS*) {

**int **freq = postings.freq();

**while **(freq-- > 0) postings.nextPosition();

BytesRef payload = postings.getPayload();

payloadValue = PayloadHelper.*decodeFloat*(payload.**bytes**, payload.**offset**);

**if **(**cosine**)

docVectorNorm += Math.*pow*(payloadValue, 2.0);

}

score = (score + payloadValue * (**vector**.get(Integer.*parseInt*(term))));

}

**if **(**cosine**) {

**if **((docVectorNorm == 0) || (**queryVectorNorm **== 0)) score = 0f;

score = (**float**)(score / (Math.*sqrt*(docVectorNorm) * Math.*sqrt*(**queryVectorNorm**)));

}

**return **score;

}

## Putting it all together

Next, I had configured Solr 8.0.0 with the query parser, _text_ field with the document plain text, vector field for BERT embedding. I’ve indexed 1000 DBPedia abstracts using index_dbpedia_abstracts.py and launched a streamlit demo in search_demo.py (all these scripts, bert and solr clients can be found in https://github.com/DmitryKey/bert-solr-search).

Searching with ‘mathematics’ took 110ms, so scoring of 1000 documents is pretty quick.

Nice property of BERT is that searching for ‘math’ gives the same top documents.

For analyzing the relevance it may be useful to visualize the distribution of scores top to bottom. Same demo script implements it:

I hope this Lucene tech study is useful to understand the inner building blocks of implementing a neural search. If you are interested in setting up the neural search for your data, I recommend you to start with the original post.

If you have questions or feedback, comment below or create issues on GitHub.