“overrequest” and “refine” JSON facet bucket counts in Solr
Solr 7.0 brings a lot to the table. Autoscaling and number of fixes or improvements we have been waiting for more than an year or so. One such is finally getting CORRECT or MISSING bucket counts in our JSON Facet API. “overrequest” is introduced in 6.3 while “refine” will make its debut in 7.0.
JSON Facet API was introduced in Solr 5.x by the master, Yonik Seeley himself. It made the bucketing possible upto “n” levels, making it one of the most flexible features. Checkout: JSON Facet Api for more. One limitation though it bring with its super fast processing was the inaccurate count for buckets in multi-sharded environment. If you are here for the solution, you may safely skip to “overrequest” and “refine” section below, for rest, we will go through the limitation we are talking about and its cause.
All tests are done multi-sharded collection, here ‘3’
Inaccurate/missing bucket counts with LIMIT parameter less than total buckets for that level in JSON facet request:
[level ‘i’th: limit < no_of_buckets]
shard:
[bucketVal]
[count]
shard1:
[A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T]
[2, 0, 3, 1, 1, 2, 2, 1, 3, 0, 0, 1, 3, 1, 1, 1, 3, 0, 0, 1]
shard2:
[A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T]
[0, 0, 1, 0, 2, 0, 1, 2, 1, 0, 0, 0, 1, 3, 0, 2, 2, 2, 1, 0]
shard3:
[A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T]
[0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 2, 1, 0, 0]
Total:
[A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T]
[2, 0, 6, 3, 3, 3, 3, 3, 4, 0, 0, 2, 5, 4, 1, 4, 7, 3, 1, 1]
“facets”:{
“count”:55,
“cat_s”:{
“buckets”:[{
“val”:”Q”,
“count”:7},
{
“val”:”C”,
“count”:6},
{
“val”:”M”,
“count”:5},
{
“val”:”I”,
“count”:4},
{
“val”:”N”,
“count”:4},
{
“val”:”P”,
“count”:4},
{
“val”:”D”,
“count”:3},
{
“val”:”E”,
“count”:3},
{
“val”:”F”,
“count”:3},
{
“val”:”G”,
“count”:3},
{
“val”:”H”,
“count”:3},
{
“val”:”R”,
“count”:3},
{
“val”:”A”,
“count”:2},
{
“val”:”L”,
“count”:2},
{
“val”:”O”,
“count”:1},
{
“val”:”S”,
“count”:1},
{
“val”:”T”,
“count”:1}]}}
Correct bucket count, moving on …
http://localhost:8983/solr/collection/select -d ‘q=*:*&json.facet={
cat_s: {
type: terms,
field: cat_s,
sort: “count desc”,
limit: 2 }}’
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
“facets”:{
“count”:55,
“cat_s”:{
“buckets”:[{
“val”:”D”,
“count”:1},
{
“val”:”F”,
“count”:1}]}}
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
The bucketVal ‘D’ and ‘F’ — 3, not 1 and while the minimum bucketVal are ‘O’, ‘S’ and ‘T’ — 1, let’s take another example:
http://localhost:8983/solr/collection/select -d ‘q=*:*&json.facet={
cat_s: {
type: terms,
field: cat_s,
sort: “count desc”,
limit: 2 }}’
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
“facets”:{
“count”:55,
“cat_s”:{
“buckets”:[{
“val”:”Q”,
“count”:7},
{
“val”:”C”,
“count”:5}]}}
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
We know that’s wrong again. C — 6 is expected.
Observe, when the limit < num_buckets (we don’t request all buckets), the bucketVal counts are haywire. JSON Facet requests behave in a typical manner in multi-shard collection. For every SHARD REQUEST, top “N”, the number specified in the “limit” is fired, is requested. And then cumulation of those bucket counts are done, top “N” are selected from those and pass it as response. There is another parameter too, it surely request some additional buckets from all shards, but not enough, how much? 10% extra of ‘limit’+ 4. Too much theory, let’s take the above example (COUNT ASC, LIMIT 2):
for ‘limit’ : 2, the total buckets requested will be: 2 + 1/10 x 2 + 4 ~= 6
shard1: any six of (D, E, H, L, N, O, P, T) — 1
shard2: C, G, I, M, S — 1, any one of (E, H, P, Q, R) — 2
shard3: F, L, M, P, R — 1, any one of (C, D, Q) — 2
cumulative (global top 6): there are so many combinations possible through tie-breaker, but if we start from bucket counts — 1 for sure:
F — 1, G — 1, ………………..
One thing for sure, ‘F’ — 1 is incorrect and we need not to look beyond (F — 3). We see what the problem is, its the classic algorithmic or logical reasoning question we have encountered at various times, “Find minimum X from Y sets of random numbers”.
“overrequest” to the rescue (6.3v):
‘overrequest’ doesn’t explain itself, but the default value is ‘10% extra of limit + 4’. So the buffer we take above the “limit” is the “overrequest”. Now we can pass the buffer above the ‘limit’ as parameter in JSON facet request as ‘overrequest’. If the limit + overrequest ≥ num_of_buckets for each shard, we will get the bucket correct count every time. See: SOLR-9654
http://localhost:8983/solr/collection/select -d ‘q=*:*&json.facet={
cat_s: {
type: terms,
field: cat_s,
sort: “count asc”,
limit: 2,
overrequest: 10}}’
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
“facets”:{
“count”:55,
“cat_s”:{
“buckets”:[{
“val”:”I”,
“count”:1},
{
“val”:”O”,
“count”:1}]}}
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
limit + overrequest = 12, still not good enough to get all the relevant buckets from three respective shards.
http://localhost:8983/solr/collection/select -d ‘q=*:*&json.facet={
cat_s: {
type: terms,
field: cat_s,
sort: “count asc”,
limit: 2,
overrequest: 18 }}’
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —“facets”:{
“count”:55,
“cat_s”:{
“buckets”:[{
“val”:”O”,
“count”:1},
{
“val”:”S”,
“count”:1}]}}
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Now its correct. NOTE: “overrequest: 0(ZERO), will not invoke the default behavior even, “-1” will.
“refine” to put cherry on the cake (7.0v):
What are we missing when we pull X balls from Y bags of random numbers. ‘X1’ from bag A may not get pulled from bag B. So what if we can request ‘X1’ from bag B explicitly. That’s the idea of ‘refine’. See: SOLR-11159
http://localhost:8983/solr/collection/select -d ‘q=*:*&json.facet={
cat_s: {
type: terms,
field: cat_s,
sort: “count asc”,
limit: 2,
overrequest: 10,
refine: true }}’
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
“facets”:{
“count”:55,
“cat_s”:{
“buckets”:[{
“val”:”O”,
“count”:1},
{
“val”:”S”,
“count”:1}]}}
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Even if the ‘overrequest’ is low, ‘limit’ + ‘overrequest’ < num_of_buckets, the ‘refine’ makes up for that.
Refinement works like the following:
1) collect the top N buckets from each shard and find the global “top N” buckets.
2) correct the counts of this global “top N” by requesting counts from shards that didn’t provide a value for each bucket.
So while this guarantees correct counts, it doesn’t that a bucket value is missed altogether.
To increase the chances of getting the true global top N, ‘overrequest’ on 1). Hence we increase this number to reduce or eliminate the chance of missing buckets.
I thank Yonik Seeley to provide the insights on various solr jiras and the respective improvements listed.
Moving forward in future releases the above mentioned explanations and examples may no longer apply. Please leave your suggestions, improvements and feedback in the comments. Cheers!