Elasticsearch 6.0 Removal of mapping types

series : Elasticsearch 6.0 is coming
Disclaimer : Elasticsearch 6.0 is not yet released so please use it carefully for sure NOT in a production environment. This article is intended for testing and understanding the new improvements from ES 6.0. 

What are Elasticsearch Types ?

A Type in Elasticsearch is so defined : “A type in Elasticsearch represent a class of similar documents”. It’s a logical aggregation we can use to identify clusters of similar documents. There was an official document which used to compare relational Databases to Elasticsearch Indices:

  1. Indices are similar to Databases/Schemas
  2. Types are compared in the DB Relational “world” to as Tables

Types are more than this and a bit more complex, and that’s why we will discover at the end of this article that in the next Major release of Elasticsearch (6.0 !!!) types will be removed… oh yes… that is the rude reality :) And so you can ask me “Why are you explaining us what are types if they will be removed soon ?”.

This is a real good question, but you have to know and to understand that for the majority of Elasticsearch users, TYPES are heavily used (and sometimes misused…) so I would like to remind to all of us, what are types, how they are used and then finally discovering why they’ve been removed and how to work with Elasticsearch without TYPES.

Every Index can have multiple Types for example “user” and “blogpost”, and every Type could have its own field. A field define the name of the field itself and the type that could be for example text, numeric (integer, long, short…), keyword, array and many many others. The complete list of fields type is :

  • array
  • binary
  • range
  • boolean
  • date
  • geo-point
  • geo-shape
  • ip
  • keyword
  • nested
  • numeric
  • object
  • text
  • token
  • percolator
  • join

You can find a detailed explanation of the field datatypes here.

In Elasticsearch you can do searches across the whole Index or specifying a type or multiple types for the same Index.

For example consider having and Index called Articles which have two types
 user and blogpost.

GET articles/_search
{
“query”: {
“match”: {
“user_name”: “kimchy”
}
}
}

This query searches across all Index named articles, and searches for all the document that matches the user_name field whose value is “kimchy”.

GET articles/user,blogpost/_search
{
“query”: {
“match”: {
“user_name”: “kimchy”
}
}
}

The previous query searches on all types of the Index named articles, and the result is identical to the previous curl.

GET articles/blogpost/_search
{
“query”: {
“match”: {
“user_name”: “kimchy”
}
}
}

The previous query searches only in type blogpost, and return only documents that belongs to this Type whose user_name field matches “kimchy”. Each document has a _type meta-field which contains the type name, you could filter searches to one or more type by specifying the type name.

In a previous paragraph I defined the Type as a “logical aggregation we can use to identify clusters of similar documents” this is absolutely true as fields defined in type are shared by all types. 
A field and its own data-type is shared across all index so you cannot have field with the same name in different type with different field-type.

user_name is the same field between user and blogpost Type.

The type of a fields shared amongst an index SHOULD be the same, as the field is just one and is always the same. If you try to set up mapping for two Types that belong to the same Index with different data-type you will get an error from Elasticsearch.

If for example you try to force a change in the mapping of blogpost user_name field from text to integer you will get this error :

{
"root_cause": [{
"type": "illegal_argument_exception",
"reason”: "mapper [user_name] cannot be changed from type [text] to [integer]"
}],
"type": "illegal_argument_exception",
"reason": "mapper [user_name] cannot be changed from type [text] to [integer]"
}

The exception is raised inside Elasticsearch here. The Mapper field type raises an error if you try to change a field type.


Why Types have been removed ?

The story begin long time ago, and it is strictly related to Apache Lucene on top of which Elasticsearch is built.

Apache Lucene is the search engine, the core of Elasticsearch. It’s completely open source, and it is the de-facto standard for indexing documents and searching them as full featured search engine. You can search, with Lucene, documents that matches a string a word or a sentence; the difference between a free text search engine and RDBMS are obvious :

RDBMS answer the question : which are the documents that matches perfectly the values inserted in the query ?

Search engine (as Lucene) answer the question : which are the most similar documents to the query (ordered by a relevance score index) ?

The job that Elasticsearch, and so Lucene, is asked to do is a little more complex. One of the algorithm used to search for documents is known as TF/IDF and there are many websites that explains how it works, I suggest this brief introduction from Elastic, which explains better how it is implemented.

ok “Federico” thank you for explaining us a bit more of the core of Elasticsearch… but you are still not answering the question “Why types have been removed” ?

The problem is related to DATA. The type of data that is in your Indices and in your Types could impact negatively the overall performances of Lucene and so of your Elasticsearch cluster.

The problem is known as sparsity and Apache Lucene suffers a lot of this. But what sparsity is ? Consider having and index :

Index : products
Type : item
field1 : 'model_id' - type : integer
field2 : 'name' - type: text
field3 : 'country' - type: text
field4 : 'country_name' - type: text
field5 : 'color' - type : array
field6 : 'date' - type : date
Type : item_price
field1 : 'model_id' - type : integer
filed2 : 'price' - type : float
field3 : 'model_discount' - type : integer
field4 : 'currency' - type : text

The two Types have model_id in common all the other fields in the types have no relationship. Now this is just an example, but imagine if you have hundreds of thousands of documents in both Types. Inside Lucene you will have many empty fields.

Sparsity : happens when your data is not dense. What I mean is that sparsity is when you have lots of documents, and lot of documents with empty fields (even if you have only one Type). 
Many documents with empty fields or non valued-field.


Let’s try to understand what Sparsity is, and go a bit deeper into the problem making an example and trying to visualise the previous Index products with two type : item, item_price

Type : item
Type : itemp_price

Remember that the Type implementation is just a meta-data field _type where the Type name is specified and Elasticsearch when you specify this name will filter only by this type. From the Lucene point of view it’s the same data structure : Index.

Visualise Sparsity

One of the main issue of sparsity is disk space.
Let’s get and example taken from a Lucene issue #6863 :

For both NUMERIC fields and ordinals of SORTED fields, we store data in a dense way. As a consequence, if you have only 1000 documents out of 1B that have a value, and 8 bits are required to store those 1000 numbers, we will not require 1KB of storage, but 1GB.
I suspect this mostly happens in abuse cases, but still it’s a pity that we explode storage requirements. We could try to detect sparsity and compress accordingly.

The problem is related on how Lucene stores informations, data. LuceneLucene team has worked a lot on this problem and mitigated it. But it’s still a problem. They have also prepared a nightly benchmark to test e how sparsity behave while they do code improvements. Let’s have a look here.

Sparse Lucene benchmarks

That’s why at the same time, more or less, Elasticsearch development team started working in this issue.

In that issue the development team started working on the idea that removing type will give more pros than cons. In the new Major coming (Elasticsearch 6.0) you can have only one type per Index.


How can I use Type in ES 6.0 ?

In Elasticsearch 6.0 Indices can contain only one type. Indices created with Elasticsearch 5.x with multiple types will continue to work but my suggestion is to upgrade you indices to one type per index strategy. The Elasticsearch Type will be completely removed in Elasticsearch 7.0.

Follow the Roadmap they have planned (you can read the complete article from elastic. the link is in the reference section) :


Testing Elasticsearch 6.0

Let’s make a full test creating an Index with multiple type in Elasticsearch 5.6, then creating a snapshot and restore it in Elasticsearch 6.0. What will happen ? What will happen to our Indices created with 5.x with the new Major version of Elasticsearch… no more words ;) but CURLS!

curl -XDELETE localhost:9200/products
curl -XPUT 'localhost:9200/products?pretty' -H 'Content-Type: application/json' -d'
{
"mappings": {
"item": {
"properties": {
"model_id": {
"type": "integer"
},
"name": {
"type": "text"
},
"country": {
"type": "integer"
},
"country_name": {
"type": "text"
},
"color": {
"type": "text"
},
"date": {
"type": "date"
}
}
},
"item_price": {
"properties": {
"model_id": {
"type": "integer"
},
"price": {
"type": "float"
},
"model_discount": {
"type": "float"
},
"currency": {
"type": "text"
}
}
}
}
}
'

Following the previous example we create an Index, in Elasticsearch 5.6, of name products, with 2 types : item and item_price.

{
"products": {
"aliases": {},
"mappings": {
"item": {
"properties": {
"color": {
"type": "text"
},
"country": {
"type": "integer"
},
"country_name": {
"type": "text"
},
"date": {
"type": "date"
},
"model_id": {
"type": "integer"
},
"name": {
"type": "text"
}
}
},
"item_price": {
"properties": {
"currency": {
"type": "text"
},
"model_discount": {
"type": "float"
},
"model_id": {
"type": "integer"
},
"price": {
"type": "float"
}
}
}
},
"settings": {
"index": {
"creation_date": "1506334542346",
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "VgmCm2CbQZyS0ZtDz4GpTA",
"version": {
"created": "5060199"
},
"provided_name": "products"
}
}
}
}

Let’s populate the Index and the types with a bulk insert:

curl -XPOST 'localhost:9200/_bulk?pretty' -H 'Content-Type: application/json' -d'
{"index":{"_index":"products","_type":"item","_id":1}}
{"model_id" : 33, "name" : "model1", "country" : 1, "country_name" : "it_IT", "color": "red", "date" : "2017-09-21"}
{"index":{"_index":"products","_type":"item","_id":2}}
{"model_id" : 34, "name" : "model2", "country" : 1, "country_name" : "it_IT", "color": "red", "date" : "2017-09-21"}
{"index":{"_index":"products","_type":"item","_id":3}}
{"model_id" : 35, "name" : "model3", "country" : 2, "country_name" : "de_DE", "color": "yellow", "date" : "2017-09-20"}
{"index":{"_index":"products","_type":"item","_id":4}}
{"model_id" : 36, "name" : "model4", "country" : 1, "country_name" : "es_ES", "color": "green", "date" : "2017-09-19"}
{"index":{"_index":"products","_type":"item","_id":5}}
{"model_id" : 37, "name" : "model5", "country" : 1, "country_name" : "it_IT", "color": "yellow", "date" : "2017-09-21"}
'
curl -XPOST 'localhost:9200/_bulk?pretty' -H 'Content-Type: application/json' -d'
{"index":{"_index":"products","_type":"item_price","_id":1}}
{"model_id" : 33, "price" : "9000", "model_discount" : 10, "currency" : "euro"}
{"index":{"_index":"products","_type":"item_price","_id":2}}
{"model_id" : 33, "price" : "5500", "model_discount" : 0, "currency" : "euro"}
{"index":{"_index":"products","_type":"item_price","_id":3}}
{"model_id" : 33, "price" : "16000", "model_discount" : 0, "currency" : "us dollar"}
{"index":{"_index":"products","_type":"item_price","_id":4}}
{"model_id" : 33, "price" : "3500", "model_discount" : 4, "currency" : "english pound"}
{"index":{"_index":"products","_type":"item_price","_id":5}}
{"model_id" : 33, "price" : "8970", "model_discount" : 3, "currency" : "euro"}
'

Now this is the situation :

  • Index name : products
  • type : item with 5 documents
  • type : item_price with 5 documents

Create a snapshot fro products index :

Before create a snapshots of the Index you should define path.repo inside elasticsearch.yml to allow Elasticsearch where to create the repository and the snapshot on the filesystem. In this example we will use FS for simplicity, if you want to learn how to do snapshots and restore from AWS S3 have a look at here.

curl -XPUT localhost:9200/_snapshot/my_backup -H 'Content-Type: application/json' -d '{
"type": "fs",
"settings": {
"location": "my_backup",
"compress": true
}
}'
curl -XPUT 'localhost:9200/_snapshot/my_backup/snapshot_1?wait_for_completion=true&pretty' -H 'Content-Type: application/json' -d '{
"indices": "products",
"ignore_unavailable": true,
"include_global_state": false
}
'

Now the idea is to start a new instance of Elasticsearch-6.0-beta2, and restore products from the snapshot created with Elasticsearch 5.6.1…

curl -XPUT localhost:9200/_snapshot/my_backup -H 'Content-Type: application/json' -d '{
"type": "fs",
"settings": {
"location": "my_backup",
"compress": true
}
}'
curl -XPOST localhost:9200/_snapshot/my_backup/snapshot_1/_restore?pretty

Now you can use Elasticsearch 6.0-beta2 with an index with multiple types as previously… but there are a lots of limitations and you should be careful doing this… :)

Elasticsearch in 6.x dropped the parent-child relationship as the types has been removed. The advice before porting Elasticsearch to the new Major version of Elasticsearch is to use Elasticsearch 5.6.1 and upgrade your indices in order to work properly in ES 6.0.

How to upgrade multiple type Indices to just one type.

There are a couple of ways you can do :

  • Split your index in a single type Index
  • Use a custom field and filter by that field to discriminate a type

Split your index in a single type Index:

Reindex API — the army knife you have to use for updating your index

Your army knife in this case is Reindex. The Reindex API endpoint will allow you to split multiple type index into an index with just a single type.

In our example we have products Index and two types : item and item_price.
The idea is to create two different Indices : item and itemprice through the Reindex API.

yellow open products iP3eHQ7lS4CBrcTjROMCEA 5 1 10 0 32.6kb 32.6kb

Now we just have an Index with two types.

curl -XPOST 'localhost:9200/_reindex?pretty' -H 'Content-Type: application/json' -d'
{
"source": {
"index": "products",
"type": "item"
},
"dest": {
"index": "item"
}
}
'

This way the Reindex will create a new Index with just one type called item. This is the response from the cluster :

{
"item" : {
"aliases" : { },
"mappings" : {
"item" : {
"properties" : {
"color" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
...
...

As you can see the newly created Index has only ONE Type.
This is one possible strategy to avoid having multiple types on your Index, mitigating the sparsity problem as the final index will have only data that belong to the same Entity. Repeating the reindex for the type item_price you will have, finally, three index :

yellow open products  ...
yellow open itemprice ...
yellow open item ...

Now you could delete the products index and start working with Elasticsearch 6.0 with item and itemprice.

Use a custom field and filter by that field to discriminate a type:

I have an index with 3 types, and all the types in the index have the same fields. So why should I split this index in 3 new index and wasting my time and make my application more complex because it should be updated and use 3 indices instead of 1 ?????

This is a specific case where Type was just used to discriminate against an index which have dense data with the same fields per each type. This could happen even if the three type does not have the fields that matches perfectly. If you think about what Types are this is exactly their definition : a discriminator field, used to filter against an Index by type.

In this case we should just reindex all the types in a new Index with a new field which we will use to filter the type itself.

Let’s make an example :

geoshapes:
types:
administrative_level_1
name:
{ type: string }
slug: { type: string, index: not_analyzed }
polygon:
type:
geo_shape
tree: quadtree
precision: 50m
center: { type: geo_point }
population: { type: integer, index: not_analyzed }
administrative_level_2:
name:
{ type: string }
slug: { type: string, index: not_analyzed }
polygon:
type:
geo_shape
tree: quadtree
precision: 50m
center: { type: geo_point }
population: { type: integer, index: not_analyzed }
administrative_level_3:
name:
{ type: string }
slug: { type: string, index: not_analyzed }
polygon:
type:
geo_shape
tree: quadtree
precision: 50m
center: { type: geo_point }
population: { type: integer, index: not_analyzed }

In this example (it’s a meta representation of an Index) I’m showing an Index geoshapes, with 3 different type : administrative_level_1, administrative_level_2, administrative_level_3.

All the type in the Index have the same fields : the strategy is to create a new field type which will allow 3 different values level1, level2, level3. We should use Reindex also in this case to create a new Index with just one type :

geoshapes:
administrative_level:
type:
{ type: string }
name: { type: string }
slug: { type: string, index: not_analyzed }
polygon:
type:
geo_shape
tree: quadtree
precision: 50m
center: { type: geo_point }
population: { type: integer, index: not_analyzed }

Every time you’ll need to get information for administrative_level_2 you should put in you query also the condition for the newly created “type” field. This strategy is absolutely identical to the _type field created to filter by type.

Conclusion

So that’s all for now. In this long article I’ve tried to explain a bit what changed in Elasticsearch 6.0, and how you can change / update your indices in order to work properly. I focused a lot on why Elastic changed on Types, because I believe everyone who’s working with Elasticsearch should understand it well.

This blog post belong to a series of articles I would like to write in order to discover better the new Major coming from Elastic.co.

References