Sitemap

Improve elasticsearch performance by blocking queries based on cardinality

3 min readAug 22, 2018

This story is about blocking elasticsearch queries for fields with high cardinality

The challenging part of the problem is we don’t want to restrict querying on fields with high cardinality.

Few examples of queries blocked

  1. GroupBy transactionId on data for last 7 days
  2. GroupBy userId and transactionId for last 24 hours

Few examples of queries not blocked

  1. GroupBy transactionId on data for last 1 day
  2. GroupBy userId and transactionId for last 2 hours

Calculate cardinality for query

First Step

Calculate cardinality for all the fields of all the indices for docs indexed in last 24 hours.

Cardinality was divided into 3 parts

a) Terms cardinality : For fields having limited value set (In our case, we used distinct values < 100 or (distinctValues < 5000 and docCount/distinctValues> 100))

Eg : One such field is status. Status can have values from (SUCCESS, FAILED, ROLLBACK, CANCELLED)

b) Percentile cardinality : For integer/double/long etc fields

c) Cardinality : For fields having large value set

Eg : transactionId, userId

Store above cardinalities along with doc count(last 24 hours) in a distributed cache (used Hazelcast). Rebuild this cache every 24 hours at the start of the day when the load on cluster is low

Assumption : Took into consideration only the docs indexed in last 24 hours for cardinality calculation assuming that all distinct values for a field would show up in 24 hours

Second Step

Estimate doc count for the query

a) Estimate doc count based on time filter : We already have the doc count stored in cache for 24 hours. Using that doc count, estimate the doc count got the time filter in the query

Let’s call it estimatedDocCountBasedOnTime

estimatedDocCountBasedOnTime = docCount * n hours/24

b) Estimate doc count based on all other filters : Using the cardinality information from Step 1 and estimatedDocCountBasedOnTime information from Step 2.a, estimate the final doc count

Eg : Filter has status = SUCCESS

cardinalityRatio = number of docs with status as ‘SUCCESS’/ docCount (this information is calculated and cached in Step 1)

estimatedDocCount = estimatedDocCountBasedOnTime * cardinalityRatio

Third Step :

Estimate overall cardinality of the query based on groupBy present in the query. The purpose of this step is to find out how many unique groupings result would be the outcome of the query

Let’s calculate the overallCardinality

for (String groupBy : groupBys){

overallCardinality = overallCardinality * fieldCardinality(groupBy)

}

Calculate cardinality for a field [fieldCardinality (String groupBy) ]

a) Terms cardinality : Return the distinct values count calculated in step 1(assuming all the limited set of values would be present in docs)

b) Percentile cardinality : Return the percentile cardinality calculated in step 1

c) Cardinality : For fields having large set of distinct values

docCountRatio = estimatedDocCount (step 2)/ docCount(step 1)

fieldCardinality = cardinality of field (step 1) * docCountRatio

This fieldCardinality would give us the distinct values present for this field in the groupBy result

Eg : 1m distinct values for transactionId in last 24 hours. If I execute the query with time filter as 6 hours, I can expect around 1m * 6/25 =~ 250k distinct values for transactionId field

Final Step : Blocking the query

The purpose of this step is to evaluate whether to block the query based on the calculations done above

The evaluation depends a lot on the amount of data, load, cluster configuration and various other factors

Few things we took into consideration for blocking the query

  1. If the estimatedDocCount (step 2)< DOC_THRESHOLD : Pass this query as this query is getting executed on small number of docs
  2. If overallCardinality (step 3) < CARDINALITY_THRESHOLD : Pass this query as the number of grouping outcomes are below a certain threshold (We used CARDINALITY_THRESHOLD =~ 50k, we started blocking queries if the grouping results expected are more than 50k. We made this CARDINALITY_THRESHOLD configurable and figured out this value with hit and trial method)

If the query is not meeting the above criteria, block this query as it might screw the cluster.

You might throw an alert for the user asking to either reduce the time filter or don’t do a groupBy on field with high cardinality

Feel free to reach out to me at nitishgoyal13@gmail.com in case of any doubts

Kindly clap or share the article, if you find it helpful. Thanks

--

--

Nitish Goyal
Nitish Goyal

Written by Nitish Goyal

Software Architect @ PhonePe, Trader, Sports-enthusiast, Traveller

No responses yet