Beyond Keywords: The Dynamic Shift to Nested Aggregations in OpenSearch

Published in

SSENSE-TECH

8 min readFeb 16, 2024

OpenSearch is a powerful data store and search engine that enables users to ingest, search, and visualize data at scale. Derived from Elasticsearch, it is an open-source suite that offers numerous plugins to enhance its performance. At SSENSE, we have integrated OpenSearch as our product catalog, allowing customers to easily search our assortment, typically referred to as “product discovery.”

We utilize the terms aggregation in OpenSearch to create the navigation menus highlighted in the image above. However, as we were developing AI-driven features for navigation, we noticed some strange behaviors in the aggregation results using this standard approach. In this article, we will delve into how we leveraged nested aggregations and how they differ from the traditional terms aggregation.

Context

Recent developments in AI and ML have taken the tech universe by storm. At SSENSE, we were nearing completion of a new AI-driven feature in the product discovery experience when we noticed a puzzling outcome. The goal here was to leverage an ML model that could assign thematic, physical, or application attributes to products by using text and image data to generate inferences. This is also referred to as “product DNA.”

The finish line was almost within reach after completing the work to expose this feature in the OpenSearch index using the traditional terms aggregation. Everything seemed to be going smoothly as we tested the UI / UX on the website and app. However, we soon discovered that the aggregations yielded incorrect results 😕.

What Happened?

Using the terms aggregation, we noticed aggregation buckets containing product documents that were not tagged with those attributes. We were getting incorrect doc counts and products being grouped in incorrect buckets.

If users were to click on a miscategorized link, they would be directed to an empty product listing page, which is frustrating, to say the least. But why was this happening? The answer lies in the subtleties of how terms aggregations are used as well as the document structure. Let’s take a closer look at both, with our specific issue in mind.

“terms” vs “nested” Aggregations

“keyword” mapping and “terms” aggregation

For the product attributes generated by the ML model, these new fields did not have the typical structure with which we normally deal. Most of the fields on our product documents are simple scalars.

{
 "id": 1234,
 "name": {
  "en": "sample product"
 },
 "category": {
  "id": 3,
  "parent_id": 1
  ...
 }
 ...
}

So the mapping template (mapping is equivalent to a schema definition) looked something like this:

{
 "id": {
        "type": "keyword"
   },
 "name": {
  "properties": {
   "en": {
    "type": "keyword"
   }
  }
 },
 "category": {
  "properties": {
   "id": {
    "type": "keyword"
   },
   "parent_id": {
    "type": "keyword"
   }
  }
 } 
}

These are easy to run a simple terms aggregation on, as OpenSearch can bucket them and perform full-text matching using the keyword type.

GET /my-index/_search
{
    "aggs": {
        "category_parent_bucket": {
            "terms": {
                "field": "category.parent_id",
                "size": 10
            },
            "aggs": {
                "category_bucket": {
                    "terms": {
                        "field": "category.id",
                        "size": 10
                    }
                }
            }
        }
    }
}

Running this aggregation would give the desired response with the expected doc counts per bucket.

{
    "aggregations": {
        "filtered": {
            "meta": {},
            "doc_count": 35,
            "category_parent_bucket": {
                "buckets": [
                    {
                        "key": "1",
                        "doc_count": 29, // parent bucket "1" and "2" adds up to 35 docs
                        "category_bucket": {
                            "buckets": [
                                {
                                    "key": "3",
                                    "doc_count": 28 // child buckets in parent "1" adds up to 29
                                },
                                {
                                    "key": "5",
                                    "doc_count": 1
                                }
                            ]
                        }
                    },
                    {
                        "key": "2",
                        "doc_count": 6,
                        "category_bucket": {
                            "buckets": [
                                {
                                    "key": "4",
                                    "doc_count": 6
                                }
                            ]
                        }
                    }
                ]
            }
        }
    }
}

Complex Data Structures

The data structure of the attributes from the ML model was not as simple. We were receiving an array of objects, which looked something like this:

{
   "attributes": [
    {
     "id": "attribute_1",
     "value": "attribute one"
    },
    {
     "id": "attribute_2",
     "value": "attribute two"
    }
  ]  
}

Say we used a similar mapping as the previous scenario:

{
    "attributes": {
        "properties": {
            "id": {
                "type": "keyword"
            },
            "value": {
                "type": "keyword"
            }
        }
    }
}

If we try to run a terms aggregation on this, you see that all the aggregation buckets are identical!

{
    "aggregations": {
        "filtered": {
            "doc_count": 10, // total docs per bucket does not add up to 10
            "attribute_id_bucket": {
                "buckets": [
                    {
                        "key": "attribute_1",
                        "doc_count": 10,
                        "attribute_value_bucket": {
                            "buckets": [
                                {
                                    "key": "attribute 1",
                                    "doc_count": 5 // should only exist in "attribute_1" bucket
                                },
                                {
                                    "key": "attribute 2",
                                    "doc_count": 3
                                },
                                {
                                    "key": "attribute 3",
                                    "doc_count": 2
                                }
                            ]
                        }
                    },
                    {
                        "key": "attribute_2",
                        "doc_count": 10,
                        "attribute_value_bucket": {
                            "buckets": [
                                {
                                    "key": "attribute 1",
                                    "doc_count": 5
                                },
                                {
                                    "key": "attribute 2",
                                    "doc_count": 3 // should only exist in "attribute_2" bucket
                                },
                                {
                                    "key": "attribute 3",
                                    "doc_count": 2
                                }
                            ]
                        }
                    },
                    {
                        "key": "attribute_3",
                        "doc_count": 10,
                        "attribute_value_bucket": {
                            "buckets": [
                                {
                                    "key": "attribute 1",
                                    "doc_count": 5
                                },
                                {
                                    "key": "attribute 2",
                                    "doc_count": 3
                                },
                                {
                                    "key": "attribute 3",
                                    "doc_count": 2 // should only exist in "attribute_3" bucket
                                }
                            ]
                        }
                    }
                ]
            }
        }
    }
}

“nested” Mapping and Aggregation

Not all products will have the same attributes, so the buckets should not all have the same doc count.

This is where nested mappings come into play. Nested types were designed for complex data structures, such as nested arrays of objects, etc. Using this type allows for complex fields in a document to be indexed in a way that treats them as unique object entities. You can read more about nested properties and aggregation in the OpenSearch documentation. Internally, nested objects index each object in the array as a separate hidden document, meaning that each nested object can be queried independently of the others.

To use nested mapping and aggregation, change the mapping to look like this:

{
    "attributes": {
        "type": "nested", // notice that the nested property is at the root level of the field
        "properties": {
            "id": {
                "type": "keyword"
            },
            "value": {
                "type": "keyword"
            }
        }
    }
}

And then update the aggregation to look like the following:

GET /my-index/_search
{
    "aggs": {
        "nested_bucket": {
            "nested": {
                "path": "attributes"
            },
            "aggs": {
                "attribute_id_bucket": {
                    "terms": {
                        "field": "attributes.id",
                        "size": 10
                    },
                    "aggs": {
                        "attribute_value_bucket": {
                            "terms": {
                                "field": "attribute.value",
                                "size": 10
                            }
                        }
                    }
                }
            }
        }
    }
}

Using this nested aggregation results in the correct buckets.

{
    "aggregations": {
        "nested_bucket": {
            "doc_count": 10,
            "attribute_id_bucket": {
                "buckets": [
                    {
                        "key": "attribute_1",
                        "doc_count": 5,
                        "attribute_value_bucket": {
                            "buckets": [
                                {
                                    "key": "attribute 1",
                                    "doc_count": 5
                                }
                            ]
                        }
                    },
                    {
                        "key": "attribute_2",
                        "doc_count": 3,
                        "attribute_value_bucket": {
                            "buckets": [
                                {
                                    "key": "attribute 2",
                                    "doc_count": 3
                                }
                            ]
                        }
                    },
                    {
                        "key": "attribute_3",
                        "doc_count": 2,
                        "attribute_value_bucket": {
                            "buckets": [
                                {
                                    "key": "attribute 3",
                                    "doc_count": 2
                                }
                            ]
                        }
                    }
                ]
            }
        }
    }
}

But leveraging this new field type was not as simple as we thought…

Index Sorting and “nested” Aggregations

We determined that using nested mappings and aggregations gives us the correct bucketing. It should be a quick and easy fix — just create a new index with the updated mapping and the same settings. But then, this error occurred:

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "cannot have nested fields when index sort is activated"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "cannot have nested fields when index sort is activated"
    },
    "status": 400
}

On our current index, we had enabled index sorting, which is not compatible with nested properties. Here’s the OpenSearch explanation of why both cannot be on at the same time:

Nested fields are not compatible with index sorting because they rely on the assumption that nested documents are stored in contiguous doc ids, which can be broken by index sorting. An error will be thrown if index sorting is activated on an index that contains nested fields.

Well, this is a problem, especially since we’ve had this setting on for performance reasons, and turning it off could cause a regression. The good news is we always pass sorting parameters to our OpenSearch queries, meaning we could turn this off without any customer impact.

It is important to ensure index sorting is turned off before using nested mappings. If sorting is required, this can be handled in multiple ways. For example, if you have a GraphQL server that accepts sorting parameters to query your OpenSearch index, you can enforce sorting at the GraphQL level. OpenSearch is capable of performing sorting on nested fields, just not index sorting. But we will save that for another discussion.

Creating the New Index

We now have a clear plan of what we need to do to enable the new attributes assigned by the ML model:

1. Create a new index

Index sorting SHOULD NOT be enabled
Add the nested property to the new field mapping

PUT /my-new-index
{
 "settings": {
  // do not set the 'sort.field' and 'sort.order' for the index
 },
 "mappings": {
  // add `"type": "nested"` to your new field
 }
}

2. Transfer documents from the old index to the new index

** This can be done using the re-index API, for example **

// `wait_for_completion=false` will generate a task ID so the re-index runs asynchronously and can be canceled on demand
POST _reindex?wait_for_completion=false
{
    "conflicts": "proceed", // so the task does not fail due to doc conflicts
    "source": {
        "index": "my-index"
    },
    "dest": {
        "index": "my-new-index",
        "op_type": "create" // the index is empty, so it should only create docs
    }
}

3. Point consumers to read to and write from the new index

** You can point consumers to your new index with the alias API, for example **

POST _aliases
{
    "actions": [
        {
            "remove": {
                "index": "my-index",
                "alias": "my-alias"
            }
        },
        {
            "add": {
                "index": "my-new-index",
                "alias": "my-alias"
            }
        }
    ]
}

And now we have our documents correctly bucketed 🙌.

What’s Next?

In this article, we discussed the key differences between keyword and nested type mapping fields, along with terms vs. nested aggregations. Although we have outlined the process of transitioning from one to another and why it may be important for this particular use case, it is still important to always run benchmark tests and analyze the impact on your consumer before making these changes. This is mainly because most of these configurations on an OpenSearch index can only be set at index creation.

For example, aggregation queries are much more memory and CPU-intensive than search queries, so this should be thoroughly tested before doing nested mappings and aggregations. In many cases, you may be better off trying to simplify the data structure of your OpenSearch documents instead, eliminating the need for complex mappings.

We have also seen that enabling index sort settings is not compatible with the nested type. However, enabling this default index sort could greatly improve your search performance, especially when combined with search sorting. If sorting and search performance are critical metrics, then using nested types may no longer be an option.

There are plenty of other key performance indicators (KPIs) to consider when defining your index settings and mappings, but this topic deserves its own discussion.

Editorial reviews by Catherine Heim, Luba Mikhnovsky & Mario Bittencourt.

Want to work with us? Click here to see all open positions at SSENSE!