“You’ve sent a query to Solr but these aren’t the results you were looking for”

Daniel González Gómez
Empathy.co
Published in
8 min readSep 21, 2018

Debugging is the correct path to take when we need to find and resolve unexpected software behaviours and this is also the case, for example, when there’s a relevancy problem in Solr. Let’s imagine we create a Solr instance with a collection deployed to obtain all the documents containing “Darth Vader.” We start by building a query like:

And we send it to Solr but… wait! These aren’t the results we’re looking for… Some results are missing and matched results are not as we’d expected. So, what´s the reason behind this?

The process of sending queries to Solr can be a little bit opaque and understanding results can be difficult when we don’t necessarily get what we were expecting. Some regular results problems include:

  • Results content not matching
  • A different number of results than expected
  • Relevancy scores being implausible
  • Time performance is poor

Understanding Solr query results

Debugging padawan! This is still a weird situation, you made a wildcard request (“q = *”) and you got a lot of results, so the collection is not empty but your beloved query is not working. We need to tackle this problem as part of your debugger skills training.

First of all, we need more info about how Solr is processing the query. To do this, we should talk about the “debug” parameter and the Solr debugging behaviour inherited by the “DebugComponent.” Since this extra information is not always required, the behaviour is turned off by default but, since it is the fastest and easiest way to find out what’s happening internally concerning the Solr request we need to turn it back on. Don’t worry, it’s really easy, we just need to specify the “debug” parameter in the troubled request and assign it the “true” value (in older Solr versions this can be specified as “debug=all”):

After sending the request with the above parameter to Solr we’ll obtain a response containing descriptive information about our request:

Understanding the meaning of every element in the response will facilitate finding a solution to our problem:

  • rawquerystring: Represents the original query content sent to Solr.
  • parsedquery: This is the query after having been translated to Lucene objects that can then be interpreted internally by Solr. For example, when requesting a wildcard query, it’s represented as: “ +(+MatchAllDocsQuery(*:*))” in opposition to our multi-term without quotation marks query that is translated as: “+(+jedi_name:darth)”.
  • parsedquery_toString: A simplified and more readable way to show the query translated in Lucene objects. A wildcard simplified parsed query example would be shown as: “+(+*:*)” while our multi-term without quotation marks parsed query example would be shown as: “+(+jedi_name:darth)”.
  • explain: Shows the relevancy score calculation of each document.
  • QParser: This is the query parser in charge of translating the query, in this case “ExtendedDismaxQParser” is being used which generates “DisjunctionMaxQueries” based on the user configuration.
  • timing: The time it took for each component of the search to be processed. For example, the output of how long it took for the query to run would be: “query”: {“time”: 0}.

Is there a query parsing problem?

To begin with, we must address an important issue, how Solr is parsing this query? This will help us to understand what kind of query we have made and whether the results we have obtained make sense. The query sent to Solr is a multi-term query, and as we can see in the above parsedquery” field with the “+(+jedi_name:darth)” value, both terms are not being included in the parsed query. We should have realized that Solr only recognizes all terms from a multi-term query if they are inside quotation marks. This is a great clue as to what the problem might be, but let’s also rule out other possible causes.

Is it a relevancy problem?

This brings us to one of the most difficult concepts to understand within this lesson; that of the relevancy of data. One of the most important aspects is to know why our results are ranked in that way. We can facilitate the analysis of this section by getting rid of the rest of the debugging sections. To do this, the “results” value can be assigned to the “debug” parameter and then only the “explain” section will be shown in the response containing the scoring calculation information.

Analyzing the response, we can observe that for every document found, the scoring is being calculated based on two key factors: weight and term frequency. As being curious is a great virtue, you may be interested to know more about scoring, if so, you can learn more here.

It therefore seems that this is not a scoring problem because scored values are admissible but our results still don’t appear as they are not in the list of calculated ids. So, everything still points to the fact that the problem is related to the query not matching with documents from the collection.

Could it be a timing problem?

Well, what happens if the problem is related to the fact that the results we’re interested in never appear because the time of the request expires before they show up in the response? This problem is often the case with collections that consist of huge amounts of documents and in some cases, we may be retrieving more data than we actually need. “debug=timing” is useful when we only need “SearchComponent” time information to be shown, that is, the time spent for each component related to the search behaviour.

We can try to retrieve some partial results to get more information about the problem, sometimes a little is better than nothing, so we can use the parameter “timeAllowed”. The amount of time (in milliseconds) can be specified as a limit for the request to finish, if the request expires, only partial results will be shown.

The main drawback of using this parameter with debugging purposes is the inaccuracy of some statistic information relative to the whole result set such as: numFound, facet counts and stats. In this case, timing information is not enough to bring light to this dark debugging issue as time values were not very high and, moreover, the response was quick.

Which documents aren’t in the result set?

Maybe you’re tired of fighting and thinking: “I wish there was some way to compare other query results with this very problematic query…”, well, I have good news for you, there is!

When we want to know why a specific document or set of documents doesn’t match the result set of a Solr query, we can compare debug information of both queries (the particular query inside the main query) by using the parameter “explainOther”. This parameter allows us to provide a different Lucene query inside the main one, and it shows the debugging information of the set of documents retrieved by the specific query relative to the main one. For example, it can be used like this:

Regarding the above results, we can find a new section containing the elements: “otherQuery”: “jedi_name:Darth Vader” and “explainOther”: {*Scoring info*}. Looking at the response of the “explainOther” we can see that the query with quotes generates and calculates the scores of the results we were expecting.

So, what’s the solution?

After the analysis carried out we can observe that the problem was caused by the fact that we forgot to include the quotation marks when searching for a multi-term query (q=jedi_name:”Darth Vader”). It’s an easy solution but it was a long journey to get here. If we want to know a little more about the cause then there are other interesting parameters that can help us to be better query debuggers.

Other useful parameters when debugging

So… do you think you know a lot about Solr debugging now, huh? Don’t underestimate the power of Solr! Mastering query debugging not only consists of knowing the main techniques that Solr offers but also the other interesting parameters that can help in the debugging process thanks to the information they provide.

Debug = query → Regarding the “debug” parameter, this can be assigned to the “query” value so that it looks only at the query parsing section related to the description of how the query is being translated.

DefType → In case we need to specify another query parser in charge of translating the main query parameter (“q”) sent to Solr we can use the “defType” parameter. For example, we can set this parameter with the “defType=dismax” value or we can let it use the default one that is the “Standard Query Parser”. After sending a query with this parameter we can observe that “QParser” has been changed from “ExtendedDismaxQParser” to “DisMaxQParser”.

OmitHeader → Allows us to remove useless information when we need to go directly to a section different from the header, this is the “omitHeader” parameter. We can avoid showing the header part of the results response by setting it to “true” (default one is false).

debug.explain.structured → Changes the “DebugComponent” response format from the default “large string” value to a “nested data structure” that depends on the “wt” parameter.

Summary

In this article, we’ve addressed the issue of what to do when a query request returns unexpected results. To do this, we’ve investigated the tools that Solr itself provides in the form of the extra parameters of a query such as the “debug=true” parameter.

However, it’s necessary to clarify that the fact of knowing these parameters does not necessarily guarantee knowing what the problem is associated to that query. It’s also necessary to have knowledge on how Solr obtains these results and how it manages them, for example, by knowing how the faceting works.

Debugging must be seen as the understanding of the whole process by means of the analysis of each element involved in that process, so… May the debugging be with you.

--

--