How Peaksys Developed a Cutting-Edge SolR Component

Emmanuel Gosse
Peaksys Engineering
8 min readNov 17, 2023
Peaksys SolR Autocomplete Component

Our search engine receives 1.5 billion queries a year, but as strange as it may seem, user query autocompletion was previously done outside of our IT system, via a SaaS service. For reasons of both financial savings and technology control, however, we chose to bring the service back in-house.

Solr and Elastic (open-source search engines) have offered basic autocomplete components for many years, but their relevance has never matched that of SaaS services. We had to redevelop these features to offer them as open source to the community.

Cdiscount Autocomplete Website

About us

A few years ago, Cdiscount’s tech subsidiary Peaksys set up teams specializing in new topics revolving around search engines, data sciences and AI. After a decade in Lucene developments on e-commerce topics, they took up the challenge of developing a new state-of-the-art autocompletion component in less than a month.

What you will find in this article

  • Why we chose to create a SolR autocomplete component and our thoughts on the limits of existing components.
  • Features analysis of the best-known e-commerce autocompletion: Google and Amazon.
  • A new Lucene Query making it possible to rival the relevance of Saas solutions.
  • The interesting code blocks of these developments, as well as the full source codes are available in the Cdiscount/Peaksys Github: https://github.com/Cdiscount/solr-autocomplete .

Why share all this?

First, because open-source sharing is essential. Doug Cutting, “his” Lucene and the creative power it brought to the Search world is still active today, more than 20 years later. And second, we are simply happy to give something back.

About standard plugins

You have the choice between a word-splitting solution with Ngram or spelling mistakes/typos with Fuzzy. You can layer them but cannot mix their features, meaning that you cannot find a partial expression with errors.

That really is too bad, as our customers do that a lot, and that is why we were soon convinced that good autocompletion is based primarily on a query, not a prepared structure.

About SolR standard suggesters

Autocomplete features

Google Autocomplete example

Based on common sense and especially analysis of Google or Amazon autocompletion relevance, here are the basic features of a good system. Of course, there is nothing new or outstanding here:

  • Multi-word matching,
  • Partially written word (mostly on the last one),
  • Fault tolerance,
  • Concatenated word matching,
  • Split word matching,
  • Finally, business value or any statistics you may want to add.

Nothing new.

Now, let us transpose this in a Lucene query.

Relevance with a Lucene query

As mentioned earlier, we preferred building a search query to using autocomplete prepared code. This is how we built it; this code block is written in a meta language to be understood easily.

At each step, a new block of code will be added and explained.

Part 1 — Match simple words + Partial final word + Business value.

part 1

The ‘for’ loop and specific block for the last word is a classic. The first N-1 words are considered complete (N is the number of words). So, textField contains the original text and this query is looking for words in it.

You can notice the simple Ngram search for the last word as it may still be unfinished. The business value is added via the weightField field.

Part 2 — Typo tolerance with Fuzzy

part 2

Part 3 — Looking for concatenated and split words

part 3

Global view

Global view

I voluntarily omitted showing weights for each element.

Key points

Notice that the block code is separated into 2 parts; one for “finished” words, the second for the word the user is currently typing. For this last word, they may still be typing it and could make a mistake, hence the fact that the trigger coefficients and weights of Fuzzy and Ngram are different and more aggressive than for finished words.

You can also note that the weights are decreasing because although it must be assumed that the customer is writing correctly, they may have made mistakes, omitted parts of words, separated or concatenated others.

The overall idea is that the less probable the fault, the lower the weight. The weights and coefficients represented here are only examples to show global and expected behavior. They are all editable and can be changed for your specific use.

Relevance Measure

If this article ended there with this query description, you might think it were a joke, like the hero killing the villain at the beginning of a movie.

But very simple tests show that one of the main behaviors is missing. This is an autocompletion and this query matches words no matter what their position in the text, when it should consider the position of the word.

Term position matters

One little thing is missing to tie all these rules together in the Lucene Query Toolkit, the feature that binds and rules them all: Term position matters.

The query should allow us to find a term (exact or fuzzy) and score it according to its position in the phrase and in the field, because an autocomplete feature is supposed to reflect the way a customer thinks and writes; from left to right, with each new word adding further detail about the target.

PositionSpanQuery

As its name suggests, we have implemented this new subquery based on the fact that SpanQueries can access the matching term’s position.

How does it work?

First, note that the main constructor arguments are a SpanQuery, the word’s expected position, and an array of values used as weights.

When collectLeaf function drives through selected terms in the field, it accesses the term’s position which can then determine, when the time comes for scoring, the “distance” between the expected position (given by your code in the constructor) and the best term position (found in collectLeaf).

By using this difference as an array index, you can get the array’s value. This is used as a weight in the score.

So, if you set the array of values with a decreasing set, you will have a “from the left” relevance -> {1f, 0.8f, 0.7f, 0.6f, 0.5f, 0.4f, 0.3f, 0.2f}

Finally, you can use the SpanPositionQuery and you are all done.

SpanPositionQuery

Let’s try

Building a simple query “apple iphone”

Let us give it a try and find the results to an “apple iphone” query. To do so, you build a BooleanQuery with a 2 queries list of SpanPositionQuery as described below.

Each query has the expected position term (0 and 1 for “apple” and “iphone”).

Scores example in PositionSpanQuery

With a very simple 2 document index example, you can check that the 2 scores respect the “from the left” effect.

Please note that this type of query allows disordered answers. As such, if no document matches the exact order of your query, you may get an answer anyway, which is exactly what is expected of a relevant autocompletion: first the best answers, then some disordered ones, and sometimes…well sometimes, even the best autocomplete cannot find a result!

Second Row, about Fuzzy?

There is a special Lucene wrapper between SpanQuery and MultiTermQuery Java classes — SpanMultiTermQueryWrapper — which can nest any MultiTermQuery within other SpanQueries.

Even though SpanMultiTermQueryWrapper works for document/field matching election and finds the right documents, scores returned by the nested query (here the FuzzyQuery) are not considered at all, which is a problem we encountered during testing and that needed to be fixed.

We ended up creating a fork for this wrapper: BoostedSpanMultiTermQueryWrapper.

This BoostedSpanMultiTermQueryWrapper allows the final score to be a multiplication of FuzzyQuery score and PositionSpanQuery score.

I will not show any code of BoostedSpanMultiTermQueryWrapper since it is just wrapping data in its embedded and technical classes, but feel free to explore the code available in our gitHub https://github.com/Cdiscount/solr-autocomplete.

AB Test, Latency and Relevance: success !

Latency metric

We tested it with custom data and an index size of around 500,000 documents, which is a normal set for a big e-commerce site based on searched phrases and product text extractions. Their selections are not part of this story.

The mean latency of the output server is about 12 ms, which is quite good for a Java program. This module can be directly exposed to client requests. Furthermore, this result is mixed with other “product suggester” modules. It means that customers will have their autocomplete answers in 50–100 ms.

Cdiscount Autocompletion Latency time (2023)
Cdiscount Autocompletion Latency time (2023)

So, what about Relevance?

We were able to measure and understand the limits of relevance (finding the moment when responses become wrong) by looking at the sites mentioned previously. We had time to tweak each subquery’s weight and effect. Therefore, we can observe that all the elements of a perfect autocompletion are gathered.

Conclusion

2022 Contest vs Saas solution : Success !

Our new component won the 2-week A/B Test against our previous Saas autocompletion solution on Cdiscount.com global traffic. That is why we are proud to publish and share it today.

In the end, we found:

  • Plus 0.86% on Sales volume,
  • Plus 1.08% on mean cart price,
  • No license or Saas solution to pay and therefore a substantial saving of several hundred thousand dollars.

With only a few small SolR servers cluster (6 servers : 8CPU/4Go RAM), we managed to handle more than 200 million queries per month (+2.5 M /year).

Next features 🚀

The most obvious thing to do now is to port this component to SolR 9.X, as it is currently adapted to SolR 8.9.X.

There is also a somewhat hidden feature on Google that we really like: Google takes your previous search into account in its autocompletion. They assume that if you launch a second search, it means that you have not achieved your goal and that, perhaps, you are trying another approach. This feature may be part of the next release…

Now just test, https://github.com/Cdiscount/solr-autocomplete

Special thanks to the happy Find Team : Jérémy Thizy and Christophe Piquet.😊

Peaksys

--

--

Emmanuel Gosse
Peaksys Engineering

Passionate innovator navigating software's uncertainties. Thrives on pioneering, breaking conventions. Apache Lucene/Solr forker. #Search #Innovation