Elasticsearch Global Ordinals
By: Noah Matisoff
Global ordinals are the internal data structures used in Elasticsearch for pre-computing and optimizing the performance of terms aggregations. They “maintain an incremental numbering for each unique term in a lexicographic order.” Global ordinals play a significant role in the execution of terms aggregations in Elasticsearch.
An example of a global ordinal mapping can be seen below, assuming the document has a field for make
, indicating the manufacturer of a vehicle. Assuming we have many documents and five or so makes across all of them, it would look something along the lines of:
Ordinal | Field
----------------
0 | Audi
1 | BMW
2 | Honda
3 | Lexus
4 | ToyotaDoc ID | Make
----------------
N | 0
N + 1 | 0
N + 2 | 1
N + 3 | 2
N + 4 | 2
N + 5 | 3
N + 6 | 4
With the above mapping defined, in memory, it is now a much simpler and more efficient operation for Elasticsearch to be aggregating on the global ordinal structure, rather than the string values. And once the operation is complete, the conversion from ordinal to string only needs to occur once in its final reduce phase.
Global ordinals are effective, except when a new segment is introduced — because this requires the cluster to rebuild the mapping.
To see the impact of eager loading global ordinals, I spun up an index and generated ~300K documents with a single field in the document. Below are the results of a terms aggregation on the field both with and without eager loading global ordinals, with a cardinality of ten (10 unique values for make
):
Without Eager Loading:
$ curl -X GET \
-H "Content-Type: application/json" \
-d @agg.json \
http://localhost:9200/cars/_search{"took":152}
With Eager Loading:
$ curl -X GET \
-H "Content-Type: application/json" \
-d @agg.json \
http://localhost:9200/cars/_search{"took":105}
Roughly a ~31% reduction in time for the terms aggregation by making the simple change in mappings below:
{
"properties": {
"make": {
"type": "keyword",
"eager_global_ordinals": true
}
}
}
Conclusion
Ultimately, Elasticsearch is unable to predict or deterministically know ahead of time which fields are going to have terms aggregations run against them. Because of that, the default behavior is to lazy load the global ordinals for keyword
and text
fields.
By explicitly stating in mappings the fields that are commonly aggregated on in your application or clients, a significant performance increase can be realized.
Note: this will, however, be shifting the cost of building global ordinals from search-time to refresh-time. For many use cases, this is a price that’s acceptable to pay for the improvements at search time.
This is a simple example, so please share your findings and the performance impact of making this change in your Elasticsearch cluster and configuration!