Improve Heavy Elasticsearch Aggregations with Random Score and Sampler Aggregation

Golan Kopichinsky
Cognigo
Published in
8 min readApr 22, 2019

Combining two Elasticsearch features: Sampler Aggregation & random scoring can help to create efficient estimated facets and insights while significantly reducing the cost of heavy & slow aggregations.

Random Sampling

Consider the following scenario: you have an Elasticsearch based search engine service that returns not only regular search results but also facets and insights, based on Elasticsearch aggregations on a very large scale. However, you wish to provide those insights on every query without knowing in advance how many results will return.

While handling a home page or a dashboard can be done via some sort of per-fetching and caching, you can’t tell in advance which query the user will enter. You can neither predict how heavy the aggregations will be or how this may slow the entire query as a result. The user can just filter a small percentage of the total results and his query would eventually be almost as heavy as a match all query that aggregates over all documents in the index.

Those heavy queries are not only very slow, but also very CPU intensive, and may lead to risks such as node crashes due to heap-out-of-memory issues.

Yet in most cases, a highly accurate aggregation result is not that significant, and in any case the elastic aggregations are approximate by design.

For example, when estimating the number of documents in some bucket, would it really matter to report 58,730,244 documents and not an estimation of around 60,000,000 documents?

Moreover, the most relevant query would typically be something like “Which are the top X buckets” for some field, together with the overall percentage.

Motivation: execute heavy aggregations on a sample of the data and get approximate results which closely represent the distribution of the real data.

All examples in this article rely on data from the DonorsChoose.org data set. The original data set is used but with a small manipulation:

data = data.sort_values(“donor_state”)

This will get the data to be indexed in a non-random order, and make it resemble to a real life system where data is not equally spread for each timestamp / source.

Ground Truth

We’ve created an example on such aggregation regular Terms Aggregation on 3 fields: donor_city, and another terms aggregation on projectId, with a sub aggregation on donor_state:

POST donorschoose/_search
{
“size”: 0,
“aggs”: {
“city”: {
“terms”: {
“field”: “donor_city”,
“size”: 10
}
},
“project”: {
“terms”: {
“field”: “projectid”,
“size”: 10
},
“aggs”: {
“by_state”: {
“terms”: {
“field”: “donor_state”,
“size”: 10
}
}
}
}
}
}

The above query returned with an average time of 800ms (for over 6 million documents indexed). Let’s examine the results for donor_city:

These are the real ground-truth results, and you should keep in mind that the more aggregations needed and the more complex they are, the heavier the query is going to be, with the cost of unreasonable time and resources.

Now let’s try to produce the same value with a different approach.

Sampler Aggregation

Elasticsearch provides a way to create a sampled aggregation via the Sampler Aggregation:

“A filtering aggregation used to limit any sub aggregations’ processing to a sample of the top-scoring documents.”

It uses a parameter called shard_size:

The sampler aggregation limits how many top-scoring documents are collected in the sample processed on each shard.

The basic syntax is:

“aggs”: {
“SAMPLE”: {
“sampler”: {
“shard_size”: 100
},
“aggs”: {…}
}
}

Let’s try the sample aggregation with the query we ran before:

POST donorschoose/_search
{
“size”: 0,
“aggs”: {
“SAMPLE”: {
“sampler”: {
“shard_size”: 2000
},
“aggs”: {
“city”: {…},
“project”: {…}
}
}
}
}

We use shard_size = 2,000 to sample a total of 10,000 documents (5 default shards).

The query result would look like:

“hits”: {
“total”: 6211956,
“max_score”: 0,
“hits”: []
},
“aggregations”: {
“SAMPLE”: {
“doc_count”: 10000,
“city”: {
“doc_count_error_upper_bound”: 18,
“sum_other_doc_count”: 1932,
“buckets”: [
{
“key”: “”,
“doc_count”: 6805
},
{
“key”: “New York”,
“doc_count”: 353
},
{
“key”: “Atlanta”,
“doc_count”: 294
},…

We can notice an improvement in time: average query time is around 100ms (8x better), and this is just a demonstration. As mentioned, in larger systems the difference can be much greater.

The Sampler Aggregation returns the number of documents actually sampled under the doc_count key, based on the number of shards and the number of real query results. If we request for a sample size that is larger than the number of results in the query, the doc_count will be equal to the query total hits count.

Approximation Of Total Hits

Using the doc_count and the query total hits, we can create an approximation of the real bucket size for each term, by simply estimating:

bucket doc_count * ( total hits / sample doc_count)

In our query case:

  • total hits = 6,211,956
  • sample size (doc_count) = 10,000
  • document count of “New York” is 353 (for example)

Then a calculated approximation would be 353 * (6,211,956 / 10,000) = 219,282 documents.

This calculated approximation does not guarantee reliable estimations in that example and as you can see the real document count for New York is 162,192.

When comparing the sample to the actual results, they differ:

We can conclude that while Sampler Aggregation significantly improves performance, in our case it provides estimations that cannot (on their own) be reliable.

Observing closely we can infer that the main cause for difference is that Sampler Aggregation selects documents by top-scoring documents.

In such case, of a query returning most or all index results, or in the general case of scoring not being relevant (since all documents are equally relevant for aggregations), this aggregation will return wrong sample results. In other words, results that do not represent the actual data.

Moreover, using a different scoring logic will not help us either, since in the terms/facets aggregation use case, all documents are like all men — created equal.

Therefore, we need to create a sample that will not take into account scoring — and for that we can use random scoring.

Random Scoring

Elasticsearch enables the ability of random scoring via the Function Score Query:

The random_score generates scores that are uniformly distributed from 0 to 1. By default, it uses the internal Lucene doc ids as a source of randomness, which is very efficient but not reproducible, meaning each query will return a new random set of scores.

It is possible to use field and seed to create a reproducible score, yet this will be less efficient: the whole process will take more time then regular querying, without sampling and randomizing the aggregations at all, making this option rather pointless for that case.

The basic random_score syntax is:

“size”: 0,
“query”: {
“function_score”: {
“random_score”: {},
“query”: {…}
}
}

The actual query is placed In the inner query.

Random Scoring with Sampler Aggregation

Using this feature gives us what we need — aggregating on a small number of documents, while using a sample that actually represents the data.

So, in our experiment this is the final query:

POST donorschoose/_search
{
“size”: 0,
“query”: {
“function_score”: {
“random_score”: {}
}
},
“aggs”: {
“SAMPLE”: {
“sampler”: {
“shard_size”: 2000
},
“aggs”: {
“city”: {…},
“project”: {…}
}
}
}
}

The randomness adds a small amount of time to the query, but as mentioned before, the basic random_score is efficient and does not add too much overhead.

As expected, the query will now produce a different result on each execution, yet even as a random sample, these results are now a truer representation of the data:

We can inspect that the differences between the samples and the actual numbers tend to be small, while - and most importantly — the order of buckets remains the same. We achieved this by merely sampling 10,000 documents, which are only 0.1% of the actual data!

Keep in mind that a relatively small sample was used here for demonstration, and this concept maintains its effectiveness with larger indices. Therefore, we can (and should) take a larger number as sample size.

Keeping a larger sample size also maintains robustness to cases with a “narrow” query that will return a small amount of results. In such cases, the query will be accurate and the sample size will be equal to the result size. For large indices, even a sample size of a few million documents is still a very small percentage of the total data.

Rounding

Another simple yet effective trick that can increase estimation reliability at client side is to round the results.

For example rounding 46,590 — > 47,000, because as mentioned earlier, the specific difference between the two matters little in most use cases. Moreover, in the case of random scoring evaluation, rounding aids in ignoring some of the noise generated by randomness.

Why Does It Work? (and in which cases it doesn’t)

We are counting documents which split evenly at the index, since each document appears exactly once in the index.

But what would happen if we tried different types of aggregations? In such cases the sampling sometimes won’t give us what we need.

Consider two cases:

  1. Cardinality — how many different cities appear in each query? In this case we can’t simply multiply the sample size with the total hits, and the calculation should be based on each query distribution. For example: if we query on donations from “New York or Los Angeles” the cardinality would be at max 2 — regardless of sample size.
  2. AVG — what is the average amount per donation? Again, in this case the amount doesn’t necessarily distribute evenly across the index. For example: assume the data has 1 billion donations of 1$ — and 1 extreme donation of 1 billion dollars. In such case, 99.9% of our samples won’t “hit“ that large donation, imposing the risk of calculating a wrong average.

To summarize, unless we attempt to aggregate variables that are equally distributed across the index, estimating results using sampling is prone to large mistakes. A more refined approach would involve both sampling and some prior estimation of the probability distribution of a field for each query, and that we cannot predict in advance.

Conclusion

In this post we demonstrated the effectiveness of combining the 2 Elasticsearch features of random scoring and Sampler Aggregation. On top of that, we showed how rounding the results can be useful. Overall, this makes an efficient way to aggregate and create useful facets and insights at scale and assists in handling heavy queries, while significantly reducing costs — in both CPU and time.

Thanks for Adam Bali who helped reviewing this post!

--

--