Improving Aggregations Performance with Elasticsearch

Driven by Code
Driven by Code
Published in
7 min readMar 1, 2023

By: Kyler Stole

TrueCar’s Model Search offers a search results page (SRP) for virtual vehicle models. Combinable filters help you narrow down your vehicle needs to specific models that have what you want. This page was actually preceded by Vehicle Rankings pages, which also filter down models but then rank the results using TrueCar’s proprietary scoring system. When we ported the Postgres filtering code over to an SRP format, the results worked just fine, but it was taking several seconds for the page to load due to the lag in calculating all of the filter options. Displaying a filters panel with a results count for each filter option is just not something Postgres handles very efficiently. Enter Elasticsearch.

TrueCar Model Search

Structuring data for our needs

Elasticsearch (ES) is built to work with documents in a way that supports the kind of aggregations we needed. Solving our aggregations problem wasn’t as simple as just switching platforms. We had to rethink how we were storing the data and how we would be querying it. In the Postgres Model Search service, we have highly normalized data; conversely, Elasticsearch stores data as flat documents.

When creating an ES index for Model Search, we loop through each relevant vehicle style (the base unit you start with when building a model) and query all of its possible data (both standard specifications and option packages). For example, a vehicle style might come with 5 or 7 seats, a gas or hybrid engine, manual or automatic transmission, and different levels of climate control. All of those values are stored with descriptive field names together in a document for the style along with the make and model names of the vehicle. Elasticsearch puts each unique value in an inverted index, a data structure that maps a value to the documents where it appears, allowing the data from each document to be counted more efficiently.

What we didn’t need

Two common reasons you see Elasticsearch used are when dealing with high volume (working with millions of documents) and freeform text search capabilities. Neither of these were concerns for our use case. We are only dealing with a few hundred models (containing a few thousand styles) and we have no open text search feature on the page. In fact, we probably could have made Postgres work for us by adjusting the schema to denormalize data like we do for a document store, but ultimately, Elasticsearch is still more appropriate for our aggregations use case.

First, the Aggregations

We make a separate query for results and aggregations, and request them together with _msearch, which avoids multiple round trips by sending both queries in a single request. Splitting the query helps with simplicity, but it is mainly done to avoid a particular pitfall. When performing a search with parameters, we add a query key to the root of our results query with the values of selected filters. If we tried to add aggregations for filter options to the same query as the results, we would get correct options for all filters except those currently being applied to the query. For example, if we want to get all possible Body Style values, we can’t be filtering the aggregation to “hatchback” only. Instead, we selectively apply parameters to each individual aggregation and exclude filtering an aggregation by its own key.

{
"body_style": {
"filter": { "bool": { "filters": [] } },
"aggregations": {
"body_style": {
"terms": { "field": "body_style", "size": 100, "min_doc_count": 0 }
}
}
},
"drive_type": {
"filter": { "bool": { "filters": [
"terms": { "body_style": ["hatchback"] }
] } },
"aggregations": {
"drive_type": {
"terms": { "field": "drive_type", "size": 100, "min_doc_count": 0 }
}
}
}
}

In some ways, aggregations were the simpler part of migrating Model Search to Elasticsearch. Each filter’s set of options is independent from the other filters. Range type filters (like MSRP and MPG) don’t need to be dynamic (we want you to be able to expand to the full range anytime), so we just keep the minimum and maximum values in memory. For the select type filters, we grab the available options using an Elasticsearch terms aggregation, which puts the distinct values into buckets for each key.

Setting size and min_doc_count

Configuring the size and min_doc_count settings for each terms aggregation is important for getting all possible values of a filter. By default, Elasticsearch will only return 10 terms (“buckets”) for the aggregation. That would be fine for a field like drive_type, which only has 4 options (AWD, 4WD, FWD, RWD), but some fields have more options and we’re working with a small data set, so to make life easy, we increase the limit to 100 buckets.

Setting the minimum document count to 0 returns buckets for all values, even if they didn’t match any hit. This allows us to provide a consistent filtering experience on Model Search, where filter options that don’t match any of the filtered documents can still be displayed with a zero count.

Counting options accurately

The terms aggregation we use to get options for most of our filters returns distinct values as buckets with a count. The counts represent the number of matching documents, but that’s not exactly what we want. We have documents for each style but this is a Model Search, and we want to display results and filters at the model level. Essentially, we want to roll up documents that represent the same model so they don’t count more than once. To accomplish that, we nest a cardinality aggregation within each filter’s terms aggregation and use that count instead of the doc_count returned by the terms aggregation. The field for the cardinality aggregation contains a unique identifier of the model.

{
"body_style": {
"filter": { "bool": { "filters": [] } },
"aggregations": {
"body_style": {
"terms": { "field": "body_style", "size": 100, "min_doc_count": 0 },
"aggregations": {
"collapsed": { "cardinality": { "field": "make_model" } }
}
}
}
}
}

Processing Results

While Elasticsearch offered a fix for our aggregations performance woes, it did complicate how we get results. Elasticsearch essentially achieves its performance by optimizing how data is stored to be queried in specific ways. You can achieve almost anything in a Postgres query, but ES feels much more limited by comparison. The queries for our Rankings pages go through multiple layers of subqueries to deal with model families (where a manufacturer uses different names for similar model lines) and results sorting.

Model Search and Vehicle Rankings actually cloak a level of complexity in their results. While vehicles are displayed at the model level because that’s how manufacturers market them, models serve as a grouping of possible styles that define how a vehicle could be built. Additionally, factory options can be added to some styles according to complex rules. If Model Search processed results at the model level and ignored the complexity below, it might produce invalid results for a combination of filters that could not actually be built. For example, a model might be offered with a gas or hybrid engine, but the MPG figure would be different for each. Instead, we operate at the style level and in Postgres, use subqueries with DISTINCT to roll up to the model level.

In Elasticsearch, we have to rely on the collapse parameter to produce model results. Just like our approach to counts in aggregations, we use a unique make_model field for the collapse parameter, which works similar to Postgres’ DISTINCT ON.

{
"query": { ... },
"sort": [...],
"collapse": {
"field": "make_model"
}
}

What about the subqueries?

On the Model Search page, we don’t worry about collapsing results for model families like we do for Vehicle Rankings. However, there is a bit of added complexity for sorting by price. We prefer to show the latest models, but if several years of a model are available concurrently (which happens as the model transitions to the next year), we still need to allow for the older years to be part of the search in case the user filters on something that was only available on the previous model. For the collapse to work correctly in this case, we sort first by year and then by price.

{
"sort": [
{ "model.year": { "mode": "min", "order": "desc" },
{ "model.msrp": { "mode": "min", "order": "asc" }
]
}

This gives us the correct results, but potentially in an incorrect order. For example, if we were sorting price low to high and the results contained a 2023 Kia Soul with an MSRP of $22k and a 2022 Nissan Sentra with an MSRP of $19k, the Soul would come first due to the year even though it had a lower MSRP. Unfortunately, we didn’t have a grand solution for that with Elasticsearch. It has strict execution steps that don’t leave room for endless subqueries like the flexible Postgres. In the end, we cheated a bit, knowing that there are only a few hundred models, and so we request all results and re-sort them client-side to ensure they end up in the correct order.

Next Steps

Elasticsearch can be counterintuitive at times, and the syntax is rather clunky, but it excels at performing aggregations. If you have SQL queries that suffer from slow aggregate functions in a relational database, then migrating to ES could offer a significant performance boost. Getting started with the Elasticsearch Guide is generally straightforward and requires very little knowledge of the internal data structures. Mastery may come later.

If you want to learn more about Model Search and Vehicle Rankings pages, check out this post that dives into our use of ActiveRecord with Arel to power similar vehicle filtering from the Postgres side.

We are hiring! If you love solving problems, please apply here. We would love to have you join us!

--

--

Driven by Code
Driven by Code

Welcome to TrueCar’s technology blog, where we write about the interesting things we‘re working on. Read, engage, and come work with us!