By: Noah Matisoff
Elasticsearch is an open-source, RESTful, distributed search and analytics engine built on Apache Lucene. Since its release in 2010, Elasticsearch has quickly become the most popular search engine, and is commonly used for log analytics, full-text search, security intelligence, business analytics, and operational intelligence use cases.
At TrueCar, we use Elasticsearch to power all of our search functionality, most importantly for searching amongst millions of vehicles spread across thousands of dealerships on the platform.
In Q3 of 2018, we needed to be able to support grouping of “similar” vehicles in our search results, to avoid saturation of results from certain dealerships with lots of inventory. The goal was to group results based on a set of pre-defined fields being identical to each other, and from the same dealership. We wanted our users to have more variety of dealerships amongst the results that met their search criteria.
Our initial reaction and thought was to leverage Elasticsearch aggregations to accomplish supporting this functionality. We are very familiar with aggregations, as they are what power our APIs that populate all of our filters that users interact with.
However, using aggregations for this use case would have:
- Required significant changes to our queries that power search.
- Increased the already ever-growing size of our payload for the query.
- Made the query much more esoteric, requiring deep knowledge of Elasticsearch and context as to why it’s not using a traditional query.
After doing more research, we found that Elasticsearch supports a
collapseargument, that takes a field name to collapse documents on. This was exactly what we needed! But… we were running on an older version of Elasticsearch that didn’t support
collapse. We decided to invest the effort in paying down some technical debt to upgrade our Elasticsearch cluster for numerous reasons, one of which being to support
collapse for grouping documents. That warrants its own post, so I will be skipping over all of the “fun” details of that, and jumping right into how
collapse works in Elasticsearch.
In its simplest form,
collapse “allows to collapse search results based on field values. The collapsing is done by selecting only the top sorted document per collapse key.”
An example Elasticsearch query that we would run for searching vehicles with collapsing would look like:
collapse also supports retrieving the “inner hits” — which are the documents matched based on the specified
collapse field name. However, specifying inner hits may have a significant performance degradation, as each result-set for inner hits executes another query on the cluster.
Below is an example
collapse query that includes inner hits from the grouped documents:
As can be seen in the query above,
max_concurrent_group_searches can be specified, since the inner hits portion of the query could cause N additional queries to fan-out.
Without explicitly defining
max_concurrent_group_searches, Elasticsearch will set it “based on the number of data nodes and the default search thread pool size.”
The response from Elasticsearch when providing the
collapse argument does not change significantly, and in our case did not cause any breaking changes to clients. Since the data is added to the response on a per-document basis in the form of an additional key, it was simple to add and allow any clients to retrieve the data in the response if needed.
An example response can be seen below:
Elasticsearch is a powerful NoSQL solution for performant searching, yet still offers options for niche needs like grouping documents or data.
collapse is a great, flexible way to group similar or like data, to improve the end-user’s experience, and avoid saturation if the document has an informal “belongs to” association.