Elasticsearch: mapping and analyzers
Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.
Elasticsearch is reputed to be schema-less, which means you can quickly insert documents to an index without specifying a schema beforehand. This differs from traditional relational databases where you have to explicitly specify tables, fields and fields types.
Under the hood though, Elasticsearch enforces a schema called mapping which describes the fields in the documents along with their data types and how they should be indexed by Lucene.
If you do not specify an explicit mapping, Elasticsearch will generate one for you based on the first document you insert. The mapping will also be automatically updated as new fields are added to the documents. Let’s see how it works.
Let’s index some user profiles in Elasticsearch. Each user profile has a
date_of_birth and a
date_of_birth is formatted as a timestamp in milliseconds.
Let’s verify that our document has effectively been created:
Let’s inspect the mapping that have been created dynamically by Elasticsearch
Elasticsearch has automatically inferred the type of each field based on a set of simple rules.
Dynamic mapping is great but it can leads to unexpected search results.
Let’s suppose we want to search for all users born in the 80s. Our user
Bobby Kennedy is born at epoch
553941680000 which corresponds to 1987–07–22. It should pop up in the results. We can issue a date range query on the
No hits ??
If you had thoroughly looked at the mapping, you shouldn’t be surprised. Our
date_of_birth field has a
long type because we did not tell Elasticsearch it should store it as a
date. This is where explicit mapping come into play. Let’s delete our index, set a mapping with explicit
date type for
date_of_birth , and reindex our document.
You cannot change the mapping of an index when documents are already present.
Note that you can set the mapping only partially. Mappings for other fields will be dynamic.
Let’s reissue our range query again
Hourah it worked !
You know more about your data than Elasticsearch does
description field has been indexed as type
text type is suited for full text values like description, body of emails, etc. Text values are passed through an analyser that converts the original string into a list of individual terms that can be indexed. The default analyser simply tokenizes the string and lower case each term.
We can inspect how the analyser works thanks to the
Let’s verify a search for the word “passionate” matches :
This is great but we would like Elasticsearch to be tolerant to the different inflected forms of a word. In our example I would like my document to match if I search for the word passion. This can be achieved with a language analyser that will stem the tokens. This process is called lemmatization and will group together all inflected forms of a word. Ex: passionate → passion, foxes → fox, jumped → jump, etc.
Let’s remove the index, apply the english analyser to our
description field, and reindex our document.
Let’s verify a search for the word passion matches:
The analysis is very flexible, you can even define your own analyser. There are also plenty of analysis plugins both official or supported by the community. A very interesting one is the phonetic analyser that analyses tokens into their phonetic equivalent using Soundex, Metaphone and other codecs.