Index mapping in Elasticsearch

Published in

Betacom

8 min readDec 21, 2020

Introduction

Welcome to our third article about the Elasticsearch engine. If you are not familiar with it, we recommend taking a look at the previous articles which are available at the Betacom page.

This article is focused on a sensitive topic: the index mapping.

Since Elasticsearch is a database without a schema, the data type we choose may seem irrelevant. However, that’s not the case at all. Data types can indeed influence search operation performances: they could either increase or decrease them. It is then crucial to choose the correct data type for all the fields of an index.

Data types

Mapping defines the structure of documents and how they are indexed and saved into Elasticsearch. It corresponds to a table schema in a relational database.

Elasticsearch allows different data types, which are listed at Field data types | Elasticsearch Reference [7.10]. The most common are objects (represented in JSON format), text, floats, integers and dates. Let’s see some details about them.

Before discussing some specifics, let’s create a new index in order to use it for today’s examples. The index will be named “recipes” and will contain some recipes and their ingredients.

PUT recipes

Objects are not supported by Apache Lucene, on which Elasticsearch is built. Indeed they are transformed during indexing and are flattened using the dot notation. The nested type is a specialized version of the object data type which maintains relationships between values when we have an objects list. Otherwise the AND operations would be equal to the OR ones. Nested objects are saved as hidden documents.

The keyword type is a text specialization used for the fields on which we need to look for exact values. Keyword fields are used to sort, aggregate and filter documents.

Type coercion
What happens if we try to insert the wrong data type in a field? Usually the new value is rejected, but there are some exceptions. For example, if a field expects numeric values, we can still insert a string containing a number: “8” is okay, whereas “8m” is rejected. Coercion means that the string is inspected and if it contains a number everything is fine and the new value is saved. The value is internally saved as a float, but the _source object contains it in the original type we wrote.

Arrays
What if we want to store multiple values for a field? Elasticsearch does not have an array data type because any field may contain zero or more values by default. Indeed we can index an array of values without defining this within the field’s mapping. Please remember that all values within an array must be of the same data type or at least coercion needs to be possible. We can indeed mix data types together as long as the provided types can be coerced into the data type used within the mapping.
We can index an array of objects, using the “nested” type if we want to query the objects independently. If we don’t need to query the objects independently, we can just use the “object” data type with which an array such as [1, [2, 3]] is stored as [1, 2, 3].

Dates
Date values can be specified in one of three ways:
• specially formatted strings,
• a long number (the number of milliseconds passed since the epoch),
• an integer (also known as a UNIX timestamp), representing the number of seconds passed since epoch,
where epoch refers to the 1st of January 1970.
Let’s take a look at how “date” fields behave by default, i.e. when no custom format is specified. Elasticsearch expects one of three formats: a date without time, a date with time, a long representing the milliseconds from epoch. If we supply a string value, the date needs to be in the ISO 8601 format: [YYYY]-[MM]-[DD]T[hh]:[mm]:[ss]Z±[hh]:[mm]. Dates are then stored as a long number representing the number of milliseconds since the epoch.

Alias
An alias mapping defines an alternate name for a field in the index. Check Alias field type | Elasticsearch Reference [7.10] for a full documentation about it.

Multi-field mapping
It is often useful to index the same field in different ways for different purposes. This is the purpose of multi-fields. For instance, a string field could be mapped as a text field for full-text search, and as a keyword field for sorting or aggregations. In such a situation, it is good practice to name the sub-field as “keyword”. We can access it using the dot notation (fieldName.subFieldName). Note that sub-fields do not compare in the _source object.

Let’s now index a document which contains all the data types described before:

POST recipes/_doc
{
 “title”: “Spaghetti carbonara”,
 “description”: “The first thing you need to do is…”,
 “ingredients”: [“spaghetti”,
                 “pig cheek”,
                 “eggs”,
                 “pecorino cheese”],
 “preparation_time_minutes”: 15,
 “servings”:{
    “min”:4,
    “max”:4 
     },
 “inserted_at”: “2020–12–21”
}

There are two basic mapping approaches:

explicit mapping, which expects fields and their types to be user-defined, usually at index creation time;
dynamic mapping, in which Elasticsearch executes every time a new field is encountered.

Explicit mapping

We can add field mappings both when creating an index and afterwards. It is however good practice to do it at creation time. Let’s delete and create the “recipes” index specifying the mapping:

DELETE recipesPUT recipes
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "description": { "type": "text" },
      "ingredients": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "preparation_time_minutes": { "type": "integer" },
      "inserted_at": { "type": "date" }
    }
  }
}

In Elasticsearch, objects are mapped implicitly by using the “properties” mapping parameter at each level of the hierarchy:

DELETE recipesPUT recipes
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "description": { "type": "text" },
      "ingredients": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "preparation_time_minutes": { "type": "integer" },
      "servings": {
        "properties": {
          "min": { "type": "integer" },
          "max": { "type": "integer" }
        }
      },
      "inserted_at": { "type": "date" }
    }
  }
}

In the previous example, since the “servings” field doesn’t contain any nested objects, the mapping doesn’t look too bad, but for more advanced mappings, it can look a little bit messy. Instead of defining objects in this way, we can use an alternative approach: the dot notation. The way it works is that we add a dot at each level of the hierarchy, i.e. as a field separator:

DELETE recipesPUT recipes
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "description": { "type": "text" },
      "ingredients": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "preparation_time_minutes": { "type": "integer" },
      "servings.min": { "type": "integer" },
      "servings.max": { "type": "integer" },
      "inserted_at": { "type": "date" }
    }
  }
}

We can check the index mapping at any time executing the following commands:

GET recipes/_mappingGET recipes/_mapping/field/titleGET recipes/_mapping/field/servings.max

If the index has already been created and we need to modify the mapping, we can execute the following command:

PUT indexName/_mapping
{
  "properties": {...}
}

If we add a new field when documents were already indexed, the old documents will not have it. All fields are indeed optional, even if they were specified in the mapping.

The mapping parameters are listed below.

“format” is used to specify the format of a date when the default one is not used.
As already seen, “properties” is the object where to insert the data types of each field.
“coerce” can be equal to true or false and enables or disables type coercion. It can be done at field or index level.
“doc_values” is a data structure of Apache Lucene used to optimize search operations. It is used to sort, aggregate and extract values via script. Note that it is an additional data structure and does not replace the others: Elasticsearch uses the best one based on the research it has to do. We can enable or disable it from mapping. We may want to disable it to save space and increase speed. Once enabled (disabled) it cannot be disabled (enabled).
“norms” refers to normalization factors that are later used at query time in order to compute the score of a document relatively to a query. Disabling them increases disk space.
“index” is used to disable fields so that they will not be considered in searches. Such fields can however be used to compute aggregations.
By default, any field containing NULL is ignored. Enabling the “null_value” parameter means that NULL values are substituted and considered. The value used to replace the NULLs must be consistent with the field data type.
“copy_to”: Copy multiple fields under one group. values are copied and not tokens, they will not be present in _source.

How can we update an existing mapping?
Suppose for example that there was a text field we want to turn into a keyword. We cannot do it. Indeed it is only possible to add some parameters such as “ignore_above” (used to cut strings longer than N). Since it is not even possible to delete the mapping, the only way to proceed is creating a new index and re-indexing the documents. The steps to do so are as follows:

create the new index,
insert the mapping into it,
index documents through a script or using the Reindex API.

Let’s see an example:

PUT newIndexName
{
  "mappings": {
    "properties": {...}
}POST _reindex
{
  "source": {"index": oldIndexName},
  "dest": {"index": newIndexName}
}

An index template is a way to tell Elasticsearch how to configure an index when it is created. It defines settings and\or mappings for indices which have a given pattern. We can define an index template with the following command:

PUT _template/templateName
{
"index_patterns": […],
"settings": {…},
"mappings": {…}
}

If during the index creation we specify other settings, they are merged with the ones from the template.

Dynamic Mapping

As already stated, the automatic detection and addition of new fields is called dynamic mapping. There are some simple rules used by Elasticsearch to determine which data type the field should have:

Dynamic mapping is enabled by default, but can be turned off via the following command:

PUT indexName
{
  “mappings”: { “dynamic”: false }
}

If the dynamic mapping is disabled it is still possible to add a field for which the mapping is not defined. Search operations on such a field do not work. In fact, the new fields are simply ignored and not rejected. It means that they are not indexed but are part of the _source object. In order to get new fields rejected, we need to set the “dynamic” parameter equal to “strict”.

Dynamic templates allow us to define custom mappings that can be applied to dynamically added fields which satisfy one or more given conditions. They are used when dynamic mapping is enabled and there is a new field for which no mapping is specified. Let’s see an example:

PUT indexName
{
  "mappings": {
    "dynamic_templates": [
      {
        "integers": {
          "match_mapping_type": "long",
          "mapping": { "type": "integer" }
        }
      }
    ]
  }
}

The dynamic templates conditions can be the following:

“match_mapping_type” is the data type detected by the JSON parser,
“match” and “unmatch” refer to the field name,
“match_pattern” checks if the field name matches a given regex,
“path_match” and “path_unmatch” look at the full path of the field, using dot notation.

The order in which conditions have to be written is from highest to lowest priority.

The "{dynamic_type}" placeholder is replaced in the mapping with the detected dynamic type. It is usually combined with "match_mapping_type":"*" when we need to add new settings.

Conclusion

Here are some recommendations to conclude:

use only the explicit mapping and set the dynamic one to “strict”,
do not always set both text and keywords to string fields,
disable type coercion,
use the correct numeric types (long and double use much more space than integer and float),
set the mapping parameters according to which operations will be performed on each field.