Elasticsearch Global Ordinals

Driven by Code
Aug 6 · 2 min read

By: Noah Matisoff

Global ordinals are the internal data structures used in Elasticsearch for pre-computing and optimizing the performance of terms aggregations. They “maintain an incremental numbering for each unique term in a lexicographic order.” Global ordinals play a significant role in the execution of terms aggregations in Elasticsearch.

Image by Pete Linforth from Pixabay

An example of a global ordinal mapping can be seen below, assuming the document has a field for make , indicating the manufacturer of a vehicle. Assuming we have many documents and five or so makes across all of them, it would look something along the lines of:

Ordinal | Field
----------------
0 | Audi
1 | BMW
2 | Honda
3 | Lexus
4 | Toyota
Doc ID | Make
----------------
N | 0
N + 1 | 0
N + 2 | 1
N + 3 | 2
N + 4 | 2
N + 5 | 3
N + 6 | 4

With the above mapping defined, in memory, it is now a much simpler and more efficient operation for Elasticsearch to be aggregating on the global ordinal structure, rather than the string values. And once the operation is complete, the conversion from ordinal to string only needs to occur once in its final reduce phase.

Global ordinals are effective, except when a new segment is introduced — because this requires the cluster to rebuild the mapping.

To see the impact of eager loading global ordinals, I spun up an index and generated ~300K documents with a single field in the document. Below are the results of a terms aggregation on the field both with and without eager loading global ordinals, with a cardinality of ten (10 unique values for make):

Without Eager Loading:

$ curl -X GET \
-H "Content-Type: application/json" \
-d @agg.json \
http://localhost:9200/cars/_search
{"took":152}

With Eager Loading:

$ curl -X GET \
-H "Content-Type: application/json" \
-d @agg.json \
http://localhost:9200/cars/_search
{"took":105}

Roughly a ~31% reduction in time for the terms aggregation by making the simple change in mappings below:

{
"properties": {
"make": {
"type": "keyword",
"eager_global_ordinals": true
}
}
}

Conclusion

Ultimately, Elasticsearch is unable to predict or deterministically know ahead of time which fields are going to have terms aggregations run against them. Because of that, the default behavior is to lazy load the global ordinals for keyword and text fields.

By explicitly stating in mappings the fields that are commonly aggregated on in your application or clients, a significant performance increase can be realized.

Note: this will, however, be shifting the cost of building global ordinals from search-time to refresh-time. For many use cases, this is a price that’s acceptable to pay for the improvements at search time.

This is a simple example, so please share your findings and the performance impact of making this change in your Elasticsearch cluster and configuration!

Driven by Code

Driven by Code

Written by

Welcome to TrueCar’s technology blog, where we write about the interesting things we‘re working on. Read, engage, and come work with us!

Driven by Code

Technology is our art. We learn so much from the community and we want to give back. This is our contribution.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade