Elasticsearch — handling mappings for dynamic document structures

Published in

John Lewis Partnership Software Engineering

9 min readApr 1, 2022

I’m Dan, an Engineer within the Cloud Search Team at John Lewis & Partners. We utilise Elasticsearch to provide search functionality across the John Lewis website and mobile applications.

Dynamic structures

The John Lewis search experience is underpinned by an Elasticsearch index, surrounded by a range of self built, Google Cloud Platform hosted Microservices which handle both populating the index and providing the customer facing APIs to search for products within it.

Simplified architecture diagram for John Lewis Search Service

In Elasticsearch terms, our dataset would be considered static — products can be added, removed and their information updated, but overall is relatively (compared to, say, an index used to store logging data) unchanging and the size of the index is consistent at around 100k documents.

Despite the fact that our data changes infrequently, the fields which may be in any one document versus another are dynamic. Sure, some core attributes will be consistent across all products e.g. title, price, productId, but many attributes will be specific to the product. For example, a “maingemstone” attribute may be common for jewellery products, but is unlikely to feature for a range of towels. Moreover, we do not have an upfront view of what all of the possible attributes for a product could be; if a new type of product is added to our assortment, the product data team may have to define a new set of attributes to capture its information. Considering the large assortment of products we sell, you quickly end up with a huge list of possible fields a document could contain.

All of this leads to complications when it comes to defining our index mappings. A well defined list of mappings is imperative for optimising indexing / search performance — it allows you to store your different data fields in the most efficient way for the functions you wish to perform on them. For smaller, well defined document structures, a good mapping is simple to construct. But when it comes to larger, more dynamic document structures, the situation becomes complicated. How can you construct a mapping without knowing all of the possible fields upfront?

Flattened data type

One potential solution is using the flattened data type for objects with an unknown set of subfields underneath. Unlike normal object fields, where subfields are mapped / indexed separately, using the flattened type means that the entire object (i.e. all of the subfields) are mapped as a single field. This allows you to define the mapping of an entire object in one single mapping entry:

{
  "dynamic": false,
  "properties": {
    "title": {
      "type": "text"
    },
    "productCode": {
      "type": "keyword"
    },
    "productAttributes": {
      "type": "flattened"
    }
  }
}

See above, how the entire mapping for ‘productAttributes’ and all of its subfields is defined in one single mapping.

The benefit of this approach is that a field with an unknown (and potentially very large) set of subfields is indexed as single, memory efficient field. The trade off is that due to the fact that, underneath, all subfields are treated as keywords, search / aggregation functionality on flattened fields is more limited than if each field was individually mapped.

Also, by defining a field as flattened, you completely lose the ability to define more specific mappings for any of its subfields. This could be a problem, say, if in your scenario the functionality available on flattened fields is sufficient for the majority of subfields, but for certain subfields you require a more specific mapping; Elasticsearch will not allow you to provide a specific mapping for a subfield if the parent field has been defined as flattened. For that, you require an approach which gives you more flexibility.

Dynamic mapping

Enter, dynamic mapping. If switched on, the dynamic mapping feature of Elasticsearch allows you to load documents into your index without defining mappings for all fields upfront. When it encounters a field you have not provided an explicit mapping for, it generates a mapping entry for you based upon the data within it. The mapping is unlikely to be correctly optimised for the functions you wish to perform on the field, but it will be indexed and therefore will be searchable, allowing you to quickly dive into your data. Enabling dynamic mapping can be done at the top level i.e. enable / disable dynamic mapping for anything within the document, but can also be overridden on a per field basis e.g. if you wanted a general approach of disabling dynamic mapping, but wanted to allow dynamic mapping for specific field(s) within your document.

{
  "dynamic": false,
  "properties": {
    "title": {
      "type": "text"
    },
    "productCode": {
      "type": "keyword"
    },
    "productAttributes": {
      "type": "object",
      "dynamic": true
    }
  }
}

See above, dynamic mapping is switched off at the top level, but is overridden to allow dynamic mapping to occur for subfields of the ‘productAttributes’ object.

Dynamic templates take dynamic mapping functionality one step further, allowing you to pattern match on specific fields within your document (e.g. match all fields with the path “productAttributes.*”) and apply an explicit mapping for them. This allows you to define the optimum mapping for these fields based upon how you intend to use them, without having to write an explicit mapping for every single field which will match the pattern (i.e. if you had 1000 fields which will match the pattern, you have to write just 1 dynamic template vs. writing 1000 explicit mappings).

Given our varying document structure, dynamic mappings / dynamic templates seem like sensible options for us; we know we want to index the product attributes, we just don’t have a complete list of what all the attributes could be. We can’t therefore (and given the number, would not want to) manually define explicit mappings for them all upfront.

But you have to be careful. Turning dynamic mapping on at a global level is dangerous; you end up with lots of fields being indexed which you may never have any requirement to search or aggregate on. This can lead to a mapping explosion, both increasing your memory footprint and significantly hindering your indexing performance as there are more indices to be built / updated. Furthermore, it’s unlikely that a dynamic mapping will ever be correctly optimised for the functions you wish to perform on the field, which can also mean you have extra indices created which you do not require.

Solution 1 — Dynamic mapping

Perhaps then a combination of the two is the ideal solution — turn dynamic mapping off at the global level, but switch it on for the fields within the document which you know will have a constantly changing set of subfields. Even better, use a dynamic template to provide a more specific mapping for each of these unknown subfields, to ensure their mappings are optimised for your use case.

{
  "dynamic": false,
  "dynamic_templates": [
    {
      "keywords_for_product_attributes": {
        "match_mapping_type": "string",
        "path_match": "productAttributes.*",
        "mapping": {
          "type": "keyword"
        }
      }
    }
  ],
  "properties": {
    "title": {
      "type": "text"
    },
    "productCode": {
      "type": "keyword"
    },
    "productAttributes": {
      "type": "object",
      "dynamic": true
    }
  }
}

See above, how dynamic mapping is switched on for subfields of the “productAttributes” object, with a dynamic template matching the “productAttributes.*” path.

Initially, this is the approach we implemented. And although in principle it does work, it isn’t without limitations.

As a team which focuses significant time into improving product ‘findability’ for its customers, we frequently experiment with different analyzers and different document structures. By and large, this experimentation requires changes to the mappings and therefore the creation of new indices, which we populate using the Reindex API. With this solution in place, we noticed that reindexing performance was poor. When attempting to speed things up by increasing the reindexing batch size, we actually started encountering failures. Upon investigation through the Task Management API, we found the offending issue to be:

{
  "failures": [
    {
      "index": "products-xxxx-xxxxx",
      "type": "_doc",
      "id": "4237123",
      "cause": {
        "type": "mapper_exception",
        "reason": "timed out while waiting for a dynamic mapping update"
      },
      "status": 500
    }
  ]
}

Essentially, Elasticsearch was unable to cope with the amount of dynamic mapping required of it, meaning timeout limits were hit and the reindexing failed (we have ~1500 different “productAttribute.X” fields across our assortment). So this left us in a tricky situation; dynamic mapping enabled us to index documents without explicitly defining all of the mappings up front, but it could not be made particularly performant. Perhaps this is obvious in hindsight, but at the time it was not an issue we were expecting to encounter.

So with solution one being insufficient, where to next? We needed to improve our indexing performance, but were limited to small batch sizes due to our reliance on our dynamic mapping template. This got us thinking — is there a way we can generate a more complete list of explicit mappings upfront and therefore only rely on our dynamic template for when new attributes are created?

As it turns out, yes.

Solution 2 — Automating explicit mappings generation

An advantage we had on our side was the existence of another field within our index — the “dimensions” field. We manually augment this field into the documents and use it to list all of the specific attributes associated with that product i.e. it contains a list of ALL of the “productAttribute.X” fields in the data for that product. We utilise this elsewhere in our service for bucket aggregations to determine facets for search results.

At any given time then, performing a Terms Aggregation on the “dimensions” field across our entire index gives a complete list of all of the unique product attributes currently within the dataset. Utilising some basic JSON libraries, the result of the terms aggregation can be parsed and used as a basis for generating explicit mappings for all of the possible product attributes. Combine those with a JSON file containing the base mappings for the other fields within the documents we wish to index, and we have a complete set of mappings for our documents.

Flow for combining base mappings JSON with generated mappings JSON for product attributes derived from the terms aggregation. Achieved using a *Kotlin script as part of a Gitlab CI pipeline.*

When combining the base mappings JSON with the generated JSON, we check for duplication of “productAttribute.X” keys. If encountered, the base mappings JSON is used as the source of truth. This is important, as it allows us to define different mappings for specific product attribute fields if required, without having to change the mapping we generate for them generally.

Overall, the result is a near complete set of mappings and minimal reliance on our dynamic templates. This puts less pressure on Elasticsearch during a reindex and allows us to run at greater batch sizes, meaning our indexing performance is improved. We are even able to save the generated mappings JSON as an artefact of our Gitlab pipeline, giving us better debugging capabilities and a “last known good” version of our mappings, which would be beneficial to us in the event of our Elasticsearch cluster falling over.

The solution is not bulletproof, as we do still require the dynamic mapping template for when new product attributes are created. However, this will be as and when smaller numbers of products are added to our assortment and require indexing, which is highly unlikely to put strain on Elasticsearch.

Conclusion

Handling unknown mappings is challenging. On the face of it, dynamic mapping capabilities solve many of the issues, but they come with their pitfalls when dealing with large numbers of dynamic fields. If indexing performance is important to you, having a well defined mapping upfront is key and cannot be avoided.

The solution described here works for us, but it does add considerable complexity to our reindexing process. Longer term, we may consider changing our data structure to enable us to utilise the flattened data type. This should significantly simplify our mappings, but would no doubt bring about new complexities of its own.

At the John Lewis Partnership we value the creativity of our engineers to discover innovative solutions. We craft the future of two of Britain’s best loved brands (John Lewis & Waitrose).

We are currently recruiting across a range of software engineering specialisms. If you like what you have read and want to learn how to join us, take the first steps here.