Grouping Documents in Elasticsearch

Driven by Code
Jul 30 · 3 min read

By: Noah Matisoff

Elasticsearch is an open-source, RESTful, distributed search and analytics engine built on Apache Lucene. Since its release in 2010, Elasticsearch has quickly become the most popular search engine, and is commonly used for log analytics, full-text search, security intelligence, business analytics, and operational intelligence use cases.

At TrueCar, we use Elasticsearch to power all of our search functionality, most importantly for searching amongst millions of vehicles spread across thousands of dealerships on the platform.

In Q3 of 2018, we needed to be able to support grouping of “similar” vehicles in our search results, to avoid saturation of results from certain dealerships with lots of inventory. The goal was to group results based on a set of pre-defined fields being identical to each other, and from the same dealership. We wanted our users to have more variety of dealerships amongst the results that met their search criteria.

Enter Aggregations

Our initial reaction and thought was to leverage Elasticsearch aggregations to accomplish supporting this functionality. We are very familiar with aggregations, as they are what power our APIs that populate all of our filters that users interact with.

However, using aggregations for this use case would have:

  1. Required significant changes to our queries that power search.
  2. Increased the already ever-growing size of our payload for the query.
  3. Made the query much more esoteric, requiring deep knowledge of Elasticsearch and context as to why it’s not using a traditional query.

After doing more research, we found that Elasticsearch supports a collapseargument, that takes a field name to collapse documents on. This was exactly what we needed! But… we were running on an older version of Elasticsearch that didn’t support collapse. We decided to invest the effort in paying down some technical debt to upgrade our Elasticsearch cluster for numerous reasons, one of which being to support collapse for grouping documents. That warrants its own post, so I will be skipping over all of the “fun” details of that, and jumping right into how collapse works in Elasticsearch.

Enter Collapse

In its simplest form, collapse “allows to collapse search results based on field values. The collapsing is done by selecting only the top sorted document per collapse key.”

An example Elasticsearch query that we would run for searching vehicles with collapsing would look like:

{
"query": {},
"collapse": {
"field": "vehicle.similarVehicleKey"
}
}

Elasticsearch collapse also supports retrieving the “inner hits” — which are the documents matched based on the specified collapse field name. However, specifying inner hits may have a significant performance degradation, as each result-set for inner hits executes another query on the cluster.

Below is an example collapse query that includes inner hits from the grouped documents:

{
"query": {},
"collapse": {
"field": "vehicle.similarVehicleKey",
"inner_hits": {
"name": "similarVehicles",
"size": 0
},
"max_concurrent_group_searches": 3
}
}

As can be seen in the query above, max_concurrent_group_searches can be specified, since the inner hits portion of the query could cause N additional queries to fan-out.

Without explicitly defining max_concurrent_group_searches, Elasticsearch will set it “based on the number of data nodes and the default search thread pool size.”

The response from Elasticsearch when providing the collapse argument does not change significantly, and in our case did not cause any breaking changes to clients. Since the data is added to the response on a per-document basis in the form of an additional key, it was simple to add and allow any clients to retrieve the data in the response if needed.

An example response can be seen below:

{
...
"hits": [
{
"_index": "vehicles",
"_type": "vehicle",
"_id": "myVin",
"_score": ...,
"_source": {...},
"inner_hits": {
"similarVehicles": {
"hits": {
...
"total": 6,
"hits": [
{
...
"_source": {...}
}
]
}
}
}
}
]
}

Conclusion

Elasticsearch is a powerful NoSQL solution for performant searching, yet still offers options for niche needs like grouping documents or data. collapse is a great, flexible way to group similar or like data, to improve the end-user’s experience, and avoid saturation if the document has an informal “belongs to” association.

Driven by Code

Technology is our art. We learn so much from the community and we want to give back. This is our contribution.

Driven by Code

Written by

Welcome to TrueCar’s technology blog, where we write about the interesting things we‘re working on. Read, engage, and come work with us!

Driven by Code

Technology is our art. We learn so much from the community and we want to give back. This is our contribution.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade