How we reclaimed 100 TB+ of storage with a single Elasticsearch API call

Published in

Ipsos Synthesio Engineering

4 min readMay 3, 2024

Photo by Ervins Strauhmanis from Flicker https://www.flickr.com/photos/ervins_strauhmanis/10135243453/

At Ipsos Synthesio, we use Elasticsearch to handle our massive amounts of data. It’s a big part of our tech setup, involving hundreds of machines and a lot of data storage. All of this isn’t cheap. The more data we store, the more machines we need, which means more money spent.

What data does Elasticsearch store exactly?

Without getting into too much detail, when you index a document, Elasticsearch splits its data into fields, some of which are metadata fields (_source, _id, _version…) and others are the fields that you have defined in (or that are dynamically added to) your mapping. Each field (depending on its type) can be composed of one or multiple data structures that are optimized for a specific role:

stored_fields stores raw, compressed data
inverted_index maps values back to documents, which makes the field searchable
doc_values stores values in a way that speeds up aggregations
etc.

Which data structures are used for which field depends on the type of the field. For example:

the _source field, which contains the full JSON document, uses stored_fields
a field that you define as being "type": "keyword” in your mapping uses both inverted_index and doc_values by default

You can also tweak this to some extent with mapping parameters. You can for example choose to add stored_fields for a specific field of your document, or to deactivate doc_values for another.

Of course, depending on the data structures you chose to use in your mapping configuration, there are things that you will be able to do or not. If you deactivate the inverted_index on a field (with the option "index": false in the mapping), you won’t be able to search on it anymore.

Shedding light on the disk usage

Historically, there was no easy way to know how much storage space was used by each of these data structures and fields.

But Elasticsearch 7.15 introduced a new feature called the “Analyze index disk usage” API. This endpoint lets you know exactly how much space is taken by each data structure of each field in your index.

It also gives a total for each type of data structure across all fields. When we called this endpoint on one of our indices, we got these results:

We expected stored_fields (used by the _source field) and inverted_index (which makes all fields searchable) to be quite big. What we didn’t expect is the term_vectors structure being as big (or actually bigger) than the inverted_index!

What are Term Vectors anyway?

Term vectors are used by the Fast Vector Highlighter. By storing the position of words within text, they make highlighting faster. Without term vectors, highlighting would still work, but the “plain” highlighter would have to re-analyze the text for every query that needs highlighting.

We use highlighting quite a lot, and who wouldn’t want fast vector highlighting, right? Right? We sure did. So, many years ago, we turned on term vectors for all text fields in our mapping and were happy knowing we had the fastest highlighter.

What we didn’t know at the time was just how much space term vectors used, and thus how costly they were.

Dropping Term Vectors

We decided to see how much we really needed term vectors by turning them off on a test index to see the impact. It turned out, the difference in highlight performance was barely noticeable. The plain highlighter is fast enough that highlighting is just a tiny, tiny fraction of the overall query time.

By turning off term vectors on all text fields, we could slim down our indices by about 20%. By the time all indices get updated, this should save us more than 100 TB of storage space.

Conclusion

This saving was possible thanks to the insights provided by the Elasticsearch Disk Usage Analysis API. Without it, we might never have realized how much space term vectors were taking up and missed out on a simple yet effective way to reduce storage, and by consequence, costs.

Sometimes, a deep dive into the details with the right tools can reveal surprising opportunities for savings. For us at Ipsos Synthesio, a single API call to Elasticsearch made a significant difference in our annual budget. It’s a good reminder of the value of regularly reviewing and optimizing our tech stack.