Elasticsearch Tips & Tricks, Part 2: Risks of Using Dynamic Mappings

Yuriy Bash
4 min readAug 24, 2018

--

In my previous post, I discussed speeding up Elasticsearch migrations and reindexing operations. In this one, I’ll discuss one of the most important parts of using ES effectively: mappings.

Mappings define the structure of, and fields in, documents in a given index. All fields have a mapping, and this mapping determines both how the field gets stored, and how it is indexed.

Elasticsearch is able to infer the type of field it should use based on the data, but this carries several risks one should be aware of. This post describes some of those risks.

tl;dr — don’t use dynamic mappings in prod, see bottom of post to see how to turn them off

Risk #1: Incorrect inference of data type

By default, when you index a document with a new field whose type is not explicitly set, ES detects the type of field and adds that field to its mapping table. For example, if you add the following document:

```
{
"user": "yuriy",
"age": 28,
"location: "brooklyn"
}
```

it infers the field type and creates a mapping that will look something like this:

```
{
"people": {
"mappings": {
"_doc": {
"user": {
"full_name": "user",
"mapping": {
"user": {
"type": "text"
}
}
},
...
}
}
}
}
```

and something similar for the other fields. This is called dynamic mapping and can be a useful feature but poses two problems: incorrect field types, and mapping explosions.

In any production environment, using dynamic mappings is probably too risky because there is a significant chance ES will infer the type incorrect, and you’ll face more problems down the line. For example, let’s say you add a document with a new field, conversion_factor, that looks like this:

```
{
"conversion_factor": 16.1,
}
```

a half_float is sufficient precision and ES could conceivably save it as that, thought it’s more likely it’ll save it as a float instead. You end up with a mapping that looks like this:

```
{
"index_name": {
"mappings": {
"_doc": {
"conversion_factor": {
"full_name": "conversion_factor",
"mapping": {
"conversion_factor": {
"type": "float"
}
}
},
...
}
}
}
}
```

if you try to save a subsequent document with a conversion_factor value of 3.40282346638528860e+37, you will get an error (hopefully) because it does not fit in a 32-bit float. Or, worse, the value is saved but you lose precision, and don't even notice the problem.

Risk #2: Mapping explosion

Let’s say the data you are indexing does not always follow the same schema. For example, let’s say this is document #1:

```
{
"user": "john",
"age": 35,
"animal": "dog"
}
```

and this is document #2:

```
{
"user": "yuriy",
"age": 28,
"color": "blue"
}
```

This is dangerous. Note that the first two fields are the same in both documents, but that the third field is unique to each. ES will create separate mappings for each one. If more documents come in, each with its own custom field, this will create a new mapping for each field, and the size of your mappings will explode, ES may start OOM’ing, and all sorts of other problems will begin happening.

What you probably want to do instead is add documents in the following format:

```
{
"user": "yuriy",
"age": 28,
"custom_field": "color",
"custom_field_value": "blue"
}
```

and

```
{
"user": "yuriy",
"age": 28,
"custom_field": "animal",
"custom_field_value": "dog"
}
```

this will create only two new fields in the mapping registry, something along the lines of:

```
{
"index": {
"mappings": {
"_doc": {
"custom_field": {
"full_name": "custom_field",
"mapping": {
"custom_field": {
"type": "text"
}
}
},
"custom_field_value": {
"full_name": "custom_field_value",
"mapping": {
"custom_field_value": {
"type": "text"
}
}
},
}
}
}
}
```

By default, index.mapping.total_fields.limit sets the maximum number of total fields to 1000, but note that you can still have nested fields (at up to 20 levels of depth), so this is something to watch out for.

Risk #3: Inability to use ES aggregations

This risk is a bit more subtle, but is an extension of Risk #2.

One of my personal favorite features in Elasticsearch is Aggregations. Aggregations let you bucket your data by any field (and subfields, and run several of them in serial). We used them extensively at Percolate for collecting metrics on social data.

An aggregation query typically looks something like:

```
GET /_search
{
"aggs" : {
"user" : {
"terms" : { "field" : "age" }
}
}
}
```

and you get a result that looks something like:

```
{
...
"aggregations" : {
"age" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : 28,
"doc_count" : 6
},
{
"key" : 29,
"doc_count" : 3
},
{
"key" : 30,
"doc_count" : 25
}
]
}
}
}
```

This is a really useful feature in analyzing your data.

The problem is, if you have many fields (as described in Risk #2), it becomes difficult to run aggregations because you have no apriori knowledge of what the fields are — i.e. in "terms" : { "field" : "age" }, you can't replace age with something else.

Whereas, if you use the mapping technique described above, you can run an aggregation with "terms" : { "field" : "custom_field" } and receive a response in this form:

```
{
...
"aggregations" : {
"custom_field" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : color,
"doc_count" : 6
},
{
"key" : animal,
"doc_count" : 3
},
]
}
}
}
```

Turning off dynamic mapping

Field mappings can’t be updated once they are created — so avoid the problem altogether by turning them off.

Turning off dynamic mappings can be done at the document and object value, by setting dynamic to false or strict. false merely ignored new fields, which may result in data loss - so set it to strict instead.

Defining mappings explicitly

Instead, define mappings explicitly. There are a few subtleties here. For more information on this, see this page.

--

--