Getting different results while issuing a query multiple times in SolrCloud!
Keen-eyed observers (and testers!) have occasionally brought to our attention something like “I issue a query multiple times and get different responses! What’s wrong?”. It turns out that the normal operation of Solr when querying can return slightly different results in two “flavors”:
- different numbers of hits and
- slightly different sort ordering.
The Solr model is “eventual consistency”. The first bullet point above corrects itself as commits propagate through the system, and the second can be addressed by specifying “distributed statistics”. We should emphasize that unless the circumstances are fairly unusual, this issue is rarely noticed. This article is intended to allay fears that “something’s wrong with Solr” when seeing this behavior.
To understand this behavior, we have to break down the process to understand:
- Indexing in the collection
- Querying the same collection
Varun Thacker, Solr Committer, wrote a well explanatory article on how indexing works in SolrCloud. Please refer: How does indexing work in SolrCloud
When we send a document for indexing this is what happens:
- The sharding algorithm decides which shard a document goes to.
- The document then gets forwarded to the shard leader and a version gets assigned to it.
- The document is then indexed on the leader and is forwarded asynchronously to all replicas
When the document gets indexed, it gets reflected in Solr query results for that particular replica/leader/node only after autocommit has been issued on it. If a soft-commit is issued, the newly indexed/updated document is searchable while if a hard-commit has been issued, a new searcher must be opened if soft-commit is not enabled.
It’s almost certain that the time an autocommit kicks off on a leader and a replica can be skewed up to their autocommit interval.
The following happens internally while indexing:
- Leader gets an update and starts to index it. The autocommit timer on the leader is started.
- Some time later due to propagation delay, the follower gets the record and starts its autocommit timer.
- More updates/documents come in for indexing.
- The timer on the leader expires and the commit happens.
- The timer on the follower happens to have a few more (or less) uncommitted documents than the leader had when its timer expired.
- I’ve used leader and follower, but the same holds true for all replicas, their counts can be slightly different due to their autocommit timers expiring at different times.
Query in a Solrcloud is distributed and is explained below:
- Querying on any node of collection will cause that node to send the query out to one node/replica in each shard. This one node/replica of each shard is selected randomly internally.
- Results are received by the node which received the request from the client/user from all the nodes/replicas and gets aggregated.
- Response is sent to the client/user.
As evident from above, the request is received by arbitrary node of a shard, for every request made, the response will be different if all the replicas of a shard are not consistent, which is exactly the case considering the indexing process stated above.
If we stop the indexing and wait for the max autocommit interval for the collection, we will receive consistent result every time.
As mentioned above, there are two “flavors” of difference, count and relevance (score) ordering. So far we’ve outlined why counts may be different. To understand why ordering may be different consider what happens during a commit:
- The current segment is closed.
- New segments are opened.
- Segments are merged by background threads. As part of merging, deleted documents and their associated data are purged.
This last point is the important part for this discussion. The choice of segments to merge is affected by their content, number of deleted docs etc. Different decisions may be made on different replicas as to which segments are merged. Thus the statistics used to calculate score may be slightly different on different replicas.
In this case it’s possible that the document may have a slightly different score depending on which replica calculated the score. Thus a particular document might sort differently relative to a document from another shard depending on which replica calculated the score. Since the aggregator node sorts by score, the ordering may be different. If this is unacceptable, Solr has an option to use distributed statistics. Please refer the official Solr reference link: Distributed Requests to implement the same. This should never be the case when sorting by other criteria, just when sorting by score.
Please understand that this is an edge case! There are only a few situations we at Lucidworks have seen where anyone notices this behavior, and often these kinds of anomalies are only noticed when running automated tests. This article is intended as a reference to reassure readers that Solr is performing as designed and there is no data loss. Distributed systems are complicated and, by their nature, sometimes have surprising behaviors!
Originally published at support.lucidworks.com.