Improve elasticsearch performance by blocking queries based on cardinality
This story is about blocking elasticsearch queries for fields with high cardinality
The challenging part of the problem is we don’t want to restrict querying on fields with high cardinality.
Few examples of queries blocked
- GroupBy transactionId on data for last 7 days
- GroupBy userId and transactionId for last 24 hours
Few examples of queries not blocked
- GroupBy transactionId on data for last 1 day
- GroupBy userId and transactionId for last 2 hours
Calculate cardinality for query
First Step
Calculate cardinality for all the fields of all the indices for docs indexed in last 24 hours.
Cardinality was divided into 3 parts
a) Terms cardinality : For fields having limited value set (In our case, we used distinct values < 100 or (distinctValues < 5000 and docCount/distinctValues> 100))
Eg : One such field is status. Status can have values from (SUCCESS, FAILED, ROLLBACK, CANCELLED)
b) Percentile cardinality : For integer/double/long etc fields
c) Cardinality : For fields having large value set
Eg : transactionId, userId
Store above cardinalities along with doc count(last 24 hours) in a distributed cache (used Hazelcast). Rebuild this cache every 24 hours at the start of the day when the load on cluster is low
Assumption : Took into consideration only the docs indexed in last 24 hours for cardinality calculation assuming that all distinct values for a field would show up in 24 hours
Second Step
Estimate doc count for the query
a) Estimate doc count based on time filter : We already have the doc count stored in cache for 24 hours. Using that doc count, estimate the doc count got the time filter in the query
Let’s call it estimatedDocCountBasedOnTime
estimatedDocCountBasedOnTime = docCount * n hours/24
b) Estimate doc count based on all other filters : Using the cardinality information from Step 1 and estimatedDocCountBasedOnTime information from Step 2.a, estimate the final doc count
Eg : Filter has status = SUCCESS
cardinalityRatio = number of docs with status as ‘SUCCESS’/ docCount (this information is calculated and cached in Step 1)
estimatedDocCount = estimatedDocCountBasedOnTime * cardinalityRatio
Third Step :
Estimate overall cardinality of the query based on groupBy present in the query. The purpose of this step is to find out how many unique groupings result would be the outcome of the query
Let’s calculate the overallCardinality
for (String groupBy : groupBys){
overallCardinality = overallCardinality * fieldCardinality(groupBy)
}
Calculate cardinality for a field [fieldCardinality (String groupBy) ]
a) Terms cardinality : Return the distinct values count calculated in step 1(assuming all the limited set of values would be present in docs)
b) Percentile cardinality : Return the percentile cardinality calculated in step 1
c) Cardinality : For fields having large set of distinct values
docCountRatio = estimatedDocCount (step 2)/ docCount(step 1)
fieldCardinality = cardinality of field (step 1) * docCountRatio
This fieldCardinality would give us the distinct values present for this field in the groupBy result
Eg : 1m distinct values for transactionId in last 24 hours. If I execute the query with time filter as 6 hours, I can expect around 1m * 6/25 =~ 250k distinct values for transactionId field
Final Step : Blocking the query
The purpose of this step is to evaluate whether to block the query based on the calculations done above
The evaluation depends a lot on the amount of data, load, cluster configuration and various other factors
Few things we took into consideration for blocking the query
- If the estimatedDocCount (step 2)< DOC_THRESHOLD : Pass this query as this query is getting executed on small number of docs
- If overallCardinality (step 3) < CARDINALITY_THRESHOLD : Pass this query as the number of grouping outcomes are below a certain threshold (We used CARDINALITY_THRESHOLD =~ 50k, we started blocking queries if the grouping results expected are more than 50k. We made this CARDINALITY_THRESHOLD configurable and figured out this value with hit and trial method)
If the query is not meeting the above criteria, block this query as it might screw the cluster.
You might throw an alert for the user asking to either reduce the time filter or don’t do a groupBy on field with high cardinality
Feel free to reach out to me at nitishgoyal13@gmail.com in case of any doubts
Kindly clap or share the article, if you find it helpful. Thanks