Fun With Apache Lucene and BERT Embeddings

Dmitry Kan
Nov 15 · 8 min read

After publishing the blog post on neural search with BERT and Solr (6.6.0), I got a few questions on how to run this with version 8.6.x of Solr. It took me a few days of going back and forth, and quite honestly a bit of despair, and finally a helping hint from the Lucene committer Adrien Grand (https://twitter.com/jpountz/status/1324093784460873731) to solve. I thought I’d share a few bits on what it took to upgrade vector query functionality from Solr 6.6 to 8.6.x and also explain the nitty-gritty detail of storing the dense embedding in Lucene and querying it in Solr.

Image for post
Image for post
Bert with Lucene in mind

The original implementation was published in https://github.com/saaay71/solr-vector-scoring accompanied with an easy to follow tutorial on how to set up vector search in Solr 6.6.0. The plugin allows to index vector data representing your documents and query them, applying document scoring based on cosine or dot product similarity. This plugin is very similar to the Elasticsearch plugin: https://github.com/MLnick/elasticsearch-vector-scoring (currently unmaintained, since Elasticsearch got their own implementation of dense vectors and vector based scoring: https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html offered under X-Pack which requires a commercial subscription).

Google’s BERT got a lot of traction in the NLP community because it allowed Google to improve the search relevance for 10% of their queries. There’ve been a few implementation of BERT with Solr in academia, but not many publicly available tutorials on how to implement it at home, for instance with Solr. That is why I wrote the blog post about going practical with Solr and BERT.

In this post I will go one step more technical and dive into Lucene internals, understanding of which will allow us to upgrade to more recent versions of Lucene and Solr.

The solr-vector-scoring plugin is storing the vector embeddings in binary format, each vector dimension encoded as float. On the configuration level it looks like this. As input we pass whitespace delimited vector values in payload format i|j, where i is the dimension id and j is the value along i-th dimension:

"vector":"0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "

To store such values in Solr, we define the following field type:

<fieldType name="VectorField" class="solr.TextField" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" termVectors="true" storeOffsetsWithPositions="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
</analyzer>
</fieldType>

Under the hood the solr.DelimitedPayloadTokenFilter uses DelimitedPayloadTokenFilter and it encodes anything to the right of the payload delimiter | in the requested format (float). And this data structure is sent to Lucene index for the given document.

So in order to store dense BERT embeddings in the search engine, we first compute them. For the word “mathematics”, we gonna have 768 dimensional vector (the exact number of dimensions depends on the BERT model you have chosen):

0.31364498,-0.081439406, …, -0.052598316

This vector gets transformed into the payload form that Solr understands:

0|0.31364498 1|-0.081439406 … 767|-0.052598316

This string is then indexed as a field of type VectorField defined above.

When you want to query against this dense vector representation, you need to compute a very similar representation for your input query and compute a distance between the vectors. The solr-vector-scoring plugin supports two modes of distance computation: cosine similarity and dot product similarity.

The plugin implements two classes as building blocks for both cosine and dot product scoring: VectorQuery class and VectorScoreQuery class. VectorQuery provides a scorer that will iterate all documents in your index and compute the distance between a document embedding and a query embedding:

@Override 
public Scorer scorer(LeafReaderContext context) throws IOException {
return new ConstantScoreScorer(
this,
score(),
DocIdSetIterator.all(context.reader().maxDoc()));
}

In the above method, score() will return the score value, which is the measure of similarity (cosine or dot product based). And this scoring is going to run for exactly each document in the index due to DocIdSetIterator is configured with visiting all documents. This is by the way a point of further optimization: instead of visiting all documents, we can do something more clever based on finding nearest neighbours (this is my next challenge to understand and implement).

VectorScoreQuery class implements the “meat” of the vector score computation logic by extending the class CustomScoreQuery found in org.apache.lucene.queries (you need to have Lucene of at least 6.6.0, up to 7.7.3 where it is marked as deprecated). This class computes receives the query vector and computes its distance to the document vector by reading the vector values directly from the Lucene index. Here is the relevant method implemented in the anonymous class instance of CustomScoreProvider class:

@Override
public float customScore(int docID, float subQueryScore, float valSrcScore) throws IOException {
float score = 0;
double docVectorNorm = 0;
LeafReader reader = context.reader();
Terms terms = reader.getTermVector(docID, field);
if(vector.size() != terms.size()){
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "indexed and input vector array must have same length");
}
TermsEnum iter = terms.iterator();
BytesRef text;
while ((text = iter.next()) != null) {
String term = text.utf8ToString();
float payloadValue = 0f;
PostingsEnum postings = iter.postings(null, PostingsEnum.ALL);
while (postings.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int freq = postings.freq();
while (freq-- > 0) postings.nextPosition();

BytesRef payload = postings.getPayload();
payloadValue = PayloadHelper.decodeFloat(payload.bytes, payload.offset);

if (cosine)
docVectorNorm += Math.pow(payloadValue, 2.0);
}

score = (float)(score + payloadValue * (vector.get(Integer.parseInt(term))));
}

if (cosine) {
if ((docVectorNorm == 0) || (queryVectorNorm == 0)) return 0f;
return (float)(score / (Math.sqrt(docVectorNorm) * Math.sqrt(queryVectorNorm)));
}

return score;
}

Essentially, the reversing of payload encoding is going on here: we call decodeFloat() method on payloads corresponding to vectors stored in the index for the given document with id docID. If cosine similarity is requested, we compute normalization of the score by product of the norms of document and query vectors.

The same implementation provides the VectorQParserPlugin configured like so:

<queryParser name="vp" class="com.github.saaay71.solr.VectorQParserPlugin" />

The query param q that usually contains the query keywords gets transformed into using the local parameters syntax, in which we pass the dense vector representation of the query, instead of its plain keywords:

q={!vp f=vector vector="0.1,4.75,0.3,1.2,0.7,4.0"}

Again query above is a short example to fit on the screen. In case of BERT, the real vector will have 768 dimensions and it will overflow the length of what GET request can process. Therefore, you will need to send it as POST. I’ve in fact implemented a demo with streamlit library (check it out if you haven’t yet!).

Now that we know the inner workings of the vector indexing and querying, let’s take a look at upgrading the solr-vector-scoring plugin to 8.6.x version of Solr. The trick is that, as I mentioned above, CustomScoreQuery got deprecated in 7.7.3 and removed in 8.0.0. Before removing it, lucene offered a hint: use org.apache.lucene.queries.function.FunctionScoreQuery. However, if you look at its source code, you will be surprised that it is declared as final and therefore you can’t inherit from it. But, thanks for the hint from Adrien Grand, I was able to wrap my input query of type VectorQuery into the FunctionScoreQuery. And one missing ingredient was.. instance of a DoubleValuesSource. This class will do the core of the work similar to what VectorScoreQuery did in CustomScoreProvider::customScore() implementation.

But that wasn’t as easy.

At first, I’d moved all the code for similarity computation into the doubleValue() method of anonymous DoubleValues class. This class has two logical parts: computing the value in doubleValue() method and advancing to the next document in advanceExact(int doc) method.

For inspiration I looked at sub-classes of DoubleValuesSource in lucene, closest of which was TermFreqDoubleValuesSource class, which computes the term frequency for a given term. Here is what it did:

public DoubleValues getValues(LeafReaderContext ctx, DoubleValues scores) throws IOException {
Terms terms = ctx.reader().terms(this.term.field());
TermsEnum te = terms == null ? null : terms.iterator();
if (te != null && te.seekExact(this.term.bytes())) {
final PostingsEnum pe = te.postings((PostingsEnum)null);

assert pe != null;

return new DoubleValues() {
public double doubleValue() throws IOException {
return (double)pe.freq();
}

public boolean advanceExact(int doc) throws IOException {
if (pe.docID() > doc) {
return false;
} else {
return pe.docID() == doc || pe.advance(doc) == doc;
}
}
};
} else {
return DoubleValues.EMPTY;
}
}

As you can see, it tracks PostingsEnum for a given term and advances the postings to the next document after extracting the term frequency for the current document.

But I don’t have an exact term to load! This is because an input query can have any float values representing the query embedding and the only reason why exact same sequence of floats would appear in my index is because I’d stored that exact document. Which defeats the purpose. We want to find documents that are similar to our query’s vector, not exactly the same. And using BERT embeddings we aspire to find semantically similar documents.

So next followed the remote debugging session with Intellij IDEA and Solr in the attempt to understand the process of search. If you are like me and need to debug Solr occasionally — I documented how to do it here. During the session, I could clearly see, that if advanceExact() returned false, very logically the doubleValue() wouldn’t be called for the given document and a default score of 0.0 would be returned to the client. As I was mimicing the implementation in TermFreqDoubleValuesSource, I’d run into various issues like ArrayOutOfBoundsException and NullPointerException.

So I took a step backward and thought about what building blocks do I need. First of all, as I didn’t have an exact term to search for, I had to iterate over all documents. Next, for each document I needed to extract the payload value and compute the necessary values like norms and the final score of the distance. Last, but not least, I needed to have access to all of the terms of the given vector field.

This brought me to thinking, that advanceExact(int doc) method should be bound to Terms and TermsEnum for the query field in a given document:

public boolean advanceExact(int doc) throws IOException {
terms = reader.getTermVector(doc, field);
if (terms == null) {
return false;
}
te = terms.iterator();
return true;
}

And after we get TermsEnum instance te, we can compute the score by iterating all payload components (we will have exactly 768 of them) like so:

public double doubleValue() throws IOException {
double docVectorNorm = 0.0;
double score = 0;
BytesRef text;
while ((text = te.next()) != null) {
String term = text.utf8ToString();
if (term.isEmpty()) {
continue;
}
float payloadValue = 0f;
PostingsEnum postings = te.postings(null, PostingsEnum.ALL);
while (postings.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int freq = postings.freq();
while (freq-- > 0) postings.nextPosition();

BytesRef payload = postings.getPayload();
payloadValue = PayloadHelper.decodeFloat(payload.bytes, payload.offset);

if (cosine)
docVectorNorm += Math.pow(payloadValue, 2.0);
}

score = (score + payloadValue * (vector.get(Integer.parseInt(term))));
}

if (cosine) {
if ((docVectorNorm == 0) || (queryVectorNorm == 0)) score = 0f;
score = (float)(score / (Math.sqrt(docVectorNorm) * Math.sqrt(queryVectorNorm)));
}
return score;
}

Next, I had configured Solr 8.0.0 with the query parser, _text_ field with the document plain text, vector field for BERT embedding. I’ve indexed 1000 DBPedia abstracts using index_dbpedia_abstracts.py and launched a streamlit demo in search_demo.py (all these scripts, bert and solr clients can be found in https://github.com/DmitryKey/bert-solr-search).

Searching with ‘mathematics’ took 110ms, so scoring of 1000 documents is pretty quick.

Image for post
Image for post

Nice property of BERT is that searching for ‘math’ gives the same top documents.

For analyzing the relevance it may be useful to visualize the distribution of scores top to bottom. Same demo script implements it:

Image for post
Image for post
Score distribution from top doc down to 10-th (left to right) for the query ‘theorem proving’

I hope this Lucene tech study is useful to understand the inner building blocks of implementing a neural search. If you are interested in setting up the neural search for your data, I recommend you to start with the original post.

If you have questions or feedback, comment below or create issues on GitHub.

The Startup

Medium's largest active publication, followed by +734K people. Follow to join our community.

Dmitry Kan

Written by

Founder, tech team lead, software engineer, manager, but also: cat lover and cyclist. Follow me on Twitter: twitter.com/@DmitryKan

The Startup

Medium's largest active publication, followed by +734K people. Follow to join our community.

Dmitry Kan

Written by

Founder, tech team lead, software engineer, manager, but also: cat lover and cyclist. Follow me on Twitter: twitter.com/@DmitryKan

The Startup

Medium's largest active publication, followed by +734K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store