Tuning search relevance with Apache Solr (part 1)

Ovidiu Mihalcea
Sep 1, 2015 · 4 min read

Confronted with increasingly large amounts of data, we need to go the extra mile and really underline the importance of the search box as the first place users go to explore and find the things they need, on their own terms.

Even though we tend to ignore it, relevance ranking cannot be avoided. Only few organizations have the expertise to build a search experience that responds in a relevant, immediate manner. This is called relevance engineering, and we can help shed some light on the subject.

Have you ever found yourself frustrated when the search does not return the information you were expecting? Worse, have you found yourself assuming that what you are searching for is not on the website, only to find the item on the same site through other means?

The most important aspect of sea

rch relevancy you have to keep in mind is that search engines like Solr are simply complex text matching systems. They can tell you when a certain term matches in a document, but they are not as adaptable as a human being is (for example, a sales representative). Moreover, to understand why Solr needs to be “helped”, you need to keep in mind the algorithm that stands behind its results matching.

Lucene (and thus Solr) uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. A positive floating-point number called score represents the relevance of each document. The higher the score, the more relevant the document. However, what we usually mean by relevance is the algorithm that we use to calculate how similar the contents of a text field are to a text query. The standard similarity algorithm used in Solr is term frequency/inverse document frequency, or TF/IDF.

Term frequency refers to how often the searched term appears in the field. The more often, the more relevant. A field containing five mentions of the same searched term is more likely to be more relevant than a field containing just one mention.

solr-term-frequency
solr-term-frequency

Inverse document frequency refers to how often the term appears in all documents in the collection. The more often, the lower the weight. The inverse document frequency formula is as follows:

solr-inverse-document-frequency
solr-inverse-document-frequency

Another aspect that Solr takes into account when calculating the score, is the field-length norm, which tells us how long the field is. The shorter the field, the higher the weight.

solr-field-length
solr-field-length

When we run a simple term query with explain set to true, you will see that the only factors involved in calculating the score are the ones explained above:

For example, a search for “fox” in a single document index: {“text”: “the quick brown fox”}:

[snippet id=”5"]

The criteria above are used to match a single term to the collection of documents. To match multiple terms to the collection, Solr uses the Vector space model, and you can read more about it here: https://en.wikipedia.org/wiki/Vector_space_model.

Outside of this core-matching algorithm, a lot of search relevancy is about the development required to put together features to allow for fuzzy matching or correctly boosting/weighing on the right criteria.

One of the first steps we can take to improve results relevance is text analysis, which is the act of normalizing text, both on index time and on query time to achieve fuzzy matching. One of the simple steps we can take in this direction is term stemming, which represents the process of turning many forms of the same word to a more normalized form (“colored”, “colors”, “coloring” are all reduced to the “color” stem) so that they match even if we index/query them with any of their forms. Solr features stemming filters for multiple languages.

For example, the Romanian and English stemming filters are specified like this:

[snippet id=”6"]

More info about stemming: https://wiki.apache.org/solr/LanguageAnalysis

Another important step for improving search relevance are query time boosts and weights. Say, for example, that you want to search for DSLR camera lens. A match in a product’s title is definitely more important than a match in a product’s description or characteristics. Because to Solr all searchable fields are born equal, we must specify which fields are more important. We can do this using the query time parameter qf:

[snippet id=”7"]

More info about query fields: https://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29

Alongside text analysis and query time boosts and weights, you can also boost phrase matching. What this means is that you can boost documents in which the search terms appear closer together. This is done using the pf query parameter:

[snippet id=”8"]

In the above example, title~2¹⁰ will use the title field with a phrase slop of 2 and a boost of 10.

Search relevancy is becoming more and more critical, because users require good service. In a future article, we will talk about click scoring and spellcheck optimizations.

eMAG TechLabs

On this blog you will find materials written by eMAG Tech community about the projects they are currently developing, the technologies they use and the manner they are using them for best results.

Ovidiu Mihalcea

Written by

web developer with a special interest in text mining and data journalism

eMAG TechLabs

On this blog you will find materials written by eMAG Tech community about the projects they are currently developing, the technologies they use and the manner they are using them for best results.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade